|
|
[ Text Corpora ] [ Image Corpora ] [ Lexical Resources ] [ NLP Applications ] |
|
|
[ How to Order ] |
|
|
|
|
|
CLE is making these linguistic resources available without cost for supporting academic, non-commercial research. The processing fees being charged will be used to maintain these resources. You are requested to contact CLE directly for any discounts (applicable only for selective public organizations in Pakistan) or for commercial licensing options.
|
|
|
|
|
|
CLE Urdu Image Corpus 20 Point Size |
|
[ Pakistan ] [ International ] |
|
|
|
Source: |
|
CLE Catalog #: |
CLE14I029 |
Release Date: |
30 October 2014 |
Data Type: |
Image |
Language(s): |
Urdu |
Distribution: |
1 DVD, Web Download |
Processing Fee (Pakistan): |
30000 PKR |
Processing Fee (International): |
250 USD |
License: |
Yes |
|
|
|
|
Introduction |
|
CLE Urdu Image Corpus for 20 Point Size is an image corpus collected from forty five books written in Noori Nastalique writing style. The selected books have coverage of Urdu character set, Urdu digits, Latin digits, English characters, Urdu aerab and special symbols of Urdu. The variety of publisher, publication date, paper, printing and transparency qualities have also been ensured during the selection of books. These books have been selected from different domains (if available) such as literature, poetry, religion, biography, novel, interviews, culture/travel, history, autobiography, science, short stories and character representation. Pages from each book are scanned at 300 DPI. Both gray scale and binary formats (if possible) of one hundred and forty nine (149) images have been generated. Edited (cropped) and unedited versions of each format are maintained to facilitate researchers who want to do research for the page frame detection of Urdu document images and also for the researchers who just want to process the images. The complete information (page number, domain, print quality, paper quality, transparency etc.) about each scanned page of the book is maintained in separate file. Typed corpus of images is also available at CLE Urdu Text Corpus 20 Point Size. |
|
|
|
Data Source |
|
CLE Urdu Image Corpus for 20 Point Size is collected from forty five Urdu books. The books information is maintained in a separate file which will also be distributed along with corpus.
|
|
|
|
Data |
|
This corpus contains forty five folders, one folder for each book. Unique identifiers have been assigned to forty five folders representing the book IDs. There are four sub folders inside each folder. The details of each folder are as follows:
-
BW_E_C_book-ID: This folder contains edited(cropped) binary version of scanned images, in JPEG or BITMAP format. Each image name is labeled as follows:
BW(Black and White)_E(Edited)_C(Cropped)_B<Book ID>_<Document type>_P<page number>_F<font size>
The document type of the page can be normal text represented as R, image represented as I or table of content represented as T.
-
BW_UE_book-ID: This folder contains unedited binary version of scanned images, in JPEG or BITMAP format. Each image name is labeled as follows:
BW(Black and White)_UE(UnEdited)_B<Book ID>_<Document type>_P<page number>_F<font size>
The document type of the page can be normal text represented as R, image represented as I or table of content represented as T.
-
G_E_C_book-ID: This folder contains edited(cropped) gray scale version of scanned images, in JPEG or BITMAP format. Each image name is labeled as follows:
G(Grayscale)_E(Edited)_C(Cropped)_B<Book ID>_<Document type>_P<page number>_F<font size>
The document type of the page can be normal text represented as R, image represented as I or table of content represented as T.
-
G_UE_book-ID: This folder contains unedited gray scale version of scanned images, in JPEG or BITMAP format. Each image name is labeled as follows:
G(Grayscale)_UE(UnEdited)_B<Book ID>_<Document type>_P<page number>_F<font size>
The document type of the page can be normal text represented as R, image represented as I or table of content represented as T.
|
|
|
|
Samples |
|
|
|
|
|
|
Binary Cropped Image |
Binary Unedited Image |
Grayscale Cropped Image |
Grayscale Unedited Image |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|