|
[ Text Corpora ] [ Image Corpora ] [ Lexical Resources ] [ NLP Applications ] |
|
|
[ How to Order ] |
|
|
|
|
|
CLE is making these linguistic resources available without cost for supporting academic, non-commercial research. The processing fees being charged will be used to maintain these resources. You are requested to contact CLE directly for any discounts (applicable only for selective public organizations in Pakistan) or for commercial licensing options.
|
|
|
|
|
|
CLE Urdu 14 Point Size Diacritic Instance Images |
|
[ Pakistan ] [ International ] |
|
|
|
Source: |
CLE Urdu Image Corpus for 14 Point Size and CLE Urdu Image Corpora for 24 - 40 Point Sizes |
CLE Catalog #: |
CLE14I023 |
Release Date: |
8 May 2014 |
Data Type: |
Image |
Language(s): |
Urdu and English |
Distribution: |
1 DVD, Web Download |
Processing Fee (Pakistan): |
30000 PKR |
Processing Fee (International): |
250 USD |
License: |
Yes |
|
|
|
|
Introduction |
|
CLE Urdu 14 Point Size Diacritic Instance Images is an image corpus collected from three hundred and forty seven pages which are scanned from one hundred and fifty three books written using Noori Nastalique. These books have a variety of publishers, publication dates, paper, printing and transparency qualities and are selected from different domains such as literature, poetry, religion, biography, novel, interviews, culture/travel, history, autobiography, science, short stories and character representation. Up to eleven pages from each book are scanned at 300 DPI to generate image corpus.
This image corpus is processed to extract diacritics including Ijam, Tashkil, punctuation marks and special symbols of Urdu. A total of twenty diacritic classes are distributed in separate folders. Each folder contains up to one hundred samples, however in a few cases the sample count may be lower. Some diacritics having similar shapes are merged in the same diacritic class, e.g. the single dots of ‘ب’ and ‘:’. |
|
|
|
Data Source |
|
CLE Urdu 14 Point Size Diacritic Instance Images is extracted from:
- CLE12I001 CLE Urdu Image Corpus 14 Point Size (for details, see: http://www.cle.org.pk/clestore/cleurduimagecorpus14pt.htm).
- CLE Urdu Image Corpora for 24 - 40 Point Sizes (for details, see: http://www.cle.org.pk/clestore/imagecorpora.htm).
|
|
|
|
Data |
|
This corpus contains 20 folders of images in BITMAP format. Each image name is labeled as follows:
B(Binarized)_<DiacriticClass>_<SampleNumber>_F<FontSize>.bmp
A separate file is maintained which contains information about each folder including folder name and number of instances of respective diacritic class. This file will be distributed along with the corpus. |
|
|
|
Samples |
|
|
|
|
|
|
Instance Image of ‘SINGLE DOT’ |
Instance Image of ‘DOUBLE DOT’ |
Instance Image of ‘MADDAH’ |
Instance Image of ‘SECONDARY STROKE OF GAAF’ |
|
|
|
|
|
|
|
|
|
|
|
|
|