Funded by:


Scope of Work

Urdu Nastalique OCR project focuses on the following scope of work.

  1. The following character set will be recognized by the Urdu OCR:
    1. Urdu alphabet given in Figure 1 below
    2. Latin digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
    3. Urdu digits ( ۰ , ۱ , ۲ , ۳ , ۴ , ۵ , ۶ , ۷ , ۸ , ۹ )
    4. Urdu aerab (ً َ ُ ِ ّٰ ٓ ٔ )
    5. Other symbols of Urdu, as follows:

          ( ؁   ؀   ؂   ؃ ؎ ؏ ؐ ؑ    ؒ  ؓ  ؔ ؟ () ' " ۔ ؛    : ، )

     Alphabet.JPG
    Figure 1. Urdu character set

  2. The text written in Noori Nastalique font style with font size range between14 and 44 will be recognized.  Smaller or larger font sizes will not be processed at this time.  Newspapers, normally published in smaller font sizes, will not be recognized by the OCR.
  3. This application will process plain text, and not process advanced formatting, e.g. Italic, bold, and underline, etc.
  4. The system will handle up to 2 columns of text.
  5. Urdu OCR will identify the Latin script written with Times New Roman, Arial and Courier font styles, within the font size range proposed for Urdu.  The Latin script will be identified by current project and will be forwarded to the existing Tesseract (open source) OCR system. The speed and accuracy rate of recognizing Latin script will depend on Tesseract.  Beyond this, Latin script recognition will not be processed in the current project.
  6. The system will handle salt and pepper noise. It will also detect page frame (text written in the page is called page frame).
  7. The system will also detect skew in the page. An image will be rejected if skew is present in the page. 
  8. The system will output plain Urdu text in Unicode format.

More work, which includes formatting output documents, recognizing advanced formatting features in the input text, extending the Nastalique OCR to other Pakistani languages, etc. will be incorporated in follow up projects, after the base framework for Urdu Nastalique has been developed.  The current functionality being proposed is sufficient to port Urdu published data online and for developing applications like book readers.

Executed by:








Copyrights: Center for Language Engineering, 2014
webmaster@cle.org.pk