Center for Language Engineering

 
 



 

 

KICS
KICS-UET


 
 

[ Text Corpora ] [ Image Corpora ] [ Lexical Resources ] [ NLP Applications ]

 
 

[ How to Order ]

 
   
 

CLE is making these linguistic resources available without cost for supporting academic, non-commercial research. The processing fees being charged will be used to maintain these resources. You are requested to contact CLE directly for any discounts (applicable only for selective public organizations in Pakistan) or for commercial licensing options.

 
     
  CLE Urdu HFL 18 Point Size (5586 classes) Document Images [ Pakistan ] [ International ]
   
 
Source: CLE Urdu Digest Corpus 100K, CLE Urdu Text Corpora for 14 - 40 Point Sizes and 18 million words corpus crawled from online news websites
CLE Catalog #: CLE13I005
Release Date: 26 December 2013
Data Type: Image
Language(s): Urdu
Distribution: 1 DVD, Web Download
Processing Fee (Pakistan): 30000 PKR
Processing Fee (International): 250 USD
License: Yes
   
  Introduction
  CLE Urdu HFL 18 Point Size (5586 classes) Document Images is an image corpus of high frequency Urdu ligatures. Text corpora listed in Data Source section are used to extract five thousand, five hundred and eighty six high frequency ligature classes. These ligature classes cover 131,000 high frequency words (for details see: http://www.cle.org.pk/software/ling_resources/wordlist.htm). Each image in the corpus includes a minimum of thirty samples of each ligature class written using Noori Nastalique writing style at 18 font size and scanned at 300 DPI in grayscale.
   
  Data Source
  The Urdu 5586 high frequency ligature classes are extracted from:

  1. CLE12T001 CLE Urdu Digest Corpus 100K covering the domains e.g. education, health, politics, international affairs, sports, business, humor and literature (for details see: http://www.cle.org.pk/clestore/urdudigestcorpus100k.htm).
  2. CLE Urdu Text Corpora for 14 to 40 Point Sizes (for details see: http://cle.org.pk/clestore/index.htm).
  3. 18 million words corpus crawled from online news websites covering a wide range of domains including sports, news, finance, culture, etc. (for details see: http://www.cle.org.pk/Publication/papers/2007/corpus_based_urdu_lexicon_development.pdf).
   
  Data
  This corpus contains 5586 images in JPEG or BITMAP format. Each image name is labeled as follows:
G(Grayscale)_HFL(HighFrequencyLigature)_<LigatureClass>_<HFLSerialNumber>_F<FontSize>
A separate file is maintained which contains information about each image in the corpus including ligature class, image name, printed ligature and its high frequency ligature serial number. This file will be distributed along with the corpus.
   
  Samples
 
 
Grayscale Image of "جبکہ" Ligature Class Grayscale Image of "کبجببے" Ligature Class Grayscale Image of "مجھے" Ligature Class Grayscale Image of "فبصد" Ligature Class  
   
   
   
 
 
 
 

webmaster@cle.org.pk