Urdu Nastalique Optical Character Recognition (OCR)

The Urdu Nastalique OCR project technical resources are divided into following four teams:

Data Collection Team

The data collection team is responsible to develop and release Urdu text and image corpora which are required for the development of sub-phases of Urdu OCR system. The team gathers the corpus requirement, develops corpus collection and tagging guidelines, defines and runs the complete process for the development of the respective text/image corpus. The team is mainly assigned to develop image corpus, text corpus, textual and non-textual document images, tagged text lines document images, tagged Latin and Nastalique document images, diacritics images, main bodies images, synthesized high frequency diacritics images, synthesized high frequency main bodies images and document images corpus having tagged diacritics, main bodies and ligatures.

Pre-processing Team

The pre-processing team is responsible to develop the pre-processing module of Urdu Nastalique OCR. This team designs and develops the noise removal, skew detection/rejection, page frame detection, page segmentation into text and non-text areas, textual areas segmentation into columns, text lines, diacritics, main bodies and ligatures, Nastalique/ Latin script detection and run marking sub-systems. The team also develops the automatic accuracy computation systems for each of the developed sub-system of pre-processing module of Urdu Nastalique OCR using tagged data.

Classification and Recognition Team

This team is responsible to develop two classification and recognition systems i.e. (1) ligature-based classification and recognition system, and (2) segmentation-based classification and recognition system for 14 to 44 font sizes. Initially team designs and develops the sub-classifiers for 14, 24, 36 font sizes on high frequency ligature classes. After maturing these sub-classifiers, the font size independent system will be developed to recognized the ligatures having font size from 14 to 44. The team also works on developing the automatic accuracy computation systems using the tagged data of diacritics, main bodies and ligatures.

Post-processing Team

Post-processing team works on design and development of ligature to word mapping system for Urdu Nastalique OCR. The team is responsible to do detailed analysis of the classification and recognition output and devise an algorithm for word segmentation system which generates the best sequence of words using the recognized sequence of ligatures. This team is responsible to define and run the process to clean and tag Urdu text corpus and Urdu words list which will be used in Urdu word segmentation system. In addition, the team also works on design and development of automatic accuracy computation system.