|
[ Text Corpora ] [ Image Corpora ] [ Lexical Resources ] [ NLP Applications ] |
|
|
[ How to Order ] |
|
|
|
|
|
CLE is making these linguistic resources available without cost for supporting academic, non-commercial research. The processing fees being charged will be used to maintain these resources. You are requested to contact CLE directly for any discounts (applicable only for selective public organizations in Pakistan) or for commercial licensing options.
|
|
|
|
|
|
CLE Urdu HFL 16 Point Size (5586 classes) Document Images |
|
[ Pakistan ] [ International ] |
|
|
|
Source: |
CLE Urdu Digest Corpus 100K, CLE Urdu Text Corpora for 14 - 40 Point Sizes and 18 million words corpus crawled from online news websites |
CLE Catalog #: |
CLE14I008 |
Release Date: |
9 January 2014 |
Data Type: |
Image |
Language(s): |
Urdu |
Distribution: |
1 DVD, Web Download |
Processing Fee (Pakistan): |
30000 PKR |
Processing Fee (International): |
250 USD |
License: |
Yes |
|
|
|
|
Introduction |
|
CLE Urdu HFL 16 Point Size (5586 classes) Document Images is an image corpus of high frequency Urdu ligatures. Text corpora listed in Data Source section are used to extract five thousand, five hundred and eighty six high frequency ligature classes. These ligature classes cover 131,000 high frequency words (for details see: http://www.cle.org.pk/software/ling_resources/wordlist.htm).
Each image in the corpus includes a minimum of thirty samples of each ligature class written using Noori Nastalique writing style at 16 font size and scanned at 300 DPI in grayscale. |
|
|
|
Data Source |
|
The Urdu 5586 high frequency ligature classes are extracted from:
- CLE12T001 CLE Urdu Digest Corpus 100K covering the domains e.g. education, health, politics, international affairs, sports, business, humor and literature (for details see: http://www.cle.org.pk/clestore/urdudigestcorpus100k.htm).
- CLE Urdu Text Corpora for 14 to 40 Point Sizes (for details see: http://cle.org.pk/clestore/index.htm).
- 18 million words corpus crawled from online news websites covering a wide range of domains including sports, news, finance, culture, etc. (for details see: http://www.cle.org.pk/Publication/papers/2007/corpus_based_urdu_lexicon_development.pdf).
|
|
|
|
Data |
|
This corpus contains 5586 images in JPEG or BITMAP format. Each image name is labeled as follows:
G(Grayscale)_HFL(HighFrequencyLigature)_<LigatureClass>_<HFLSerialNumber>_F<FontSize>
A separate file is maintained which contains information about each image in the corpus including ligature class, image name, printed ligature and its high frequency ligature serial number. This file will be distributed along with the corpus. |
|
|
|
Samples |
|
|
|
|
|
|
Grayscale Image of "جبکہ" Ligature Class |
Grayscale Image of "کبجببے" Ligature Class |
Grayscale Image of "مجھے" Ligature Class |
Grayscale Image of "فبصد" Ligature Class |
|
|
|
|
|
|
|
|
|
|
|
|
|