Urdu Nastalique Optical Character Recognition (OCR)

Access to online information has now become a critical factor in the development of nations globally. Much of local content is available in published form in local languages. Further much of the content being made available online in Urdu is in the form of images, due to the legacy technology being used by the publishing industry, which makes it very bulky to transfer over the low bandwidth connections and renders it unsearchable. With growing access to internet through broad band and mobile, there is increasing need to port relevant content in local languages online to more effectively use the data channels for the benefit of public in Pakistan. Technologies like the Urdu Optical character recognition (OCR) are necessary to drive this change.

Urdu Nastalique Optical Character Recognition (OCR) project processes Urdu document images written in Noori Nastalique writing style having font size range from 14 to 44 and outputs Urdu text in Unicode format. This project will accelerate the process of publishing Urdu online content and make published Urdu material more accessible to illiterate and blind community.

The specific objectives of Urdu OCR are

To develop and mature algorithms for analyzing and recognizing Urdu text images based on segmentation-based and ligature-based methods
To develop automatic scaling algorithms for Urdu ligatures to make font size independent system
To develop Urdu OCR for Nastalique style of writing
To develop post-processing algorithms in computational linguistics for output generation and error correction of Urdu OCR
To identify future research directions for graduate research in this area
To develop capacity in the area of Human Language Technology
To create and release an Urdu text image corpus with open license for further development and testing of OCR for Urdu and other Pakistani languages by other interested research organizations and universities
To provide access to textual information to print disable communities