CLE Store

Center for Language Engineering

[ Text Corpora ] [ Image Corpora ] [ Lexical Resources ] [ NLP Applications ]

CLE is making these linguistic resources available without cost for supporting academic, non-commercial research. The processing fees being charged will be used to maintain these resources. You are requested to contact CLE directly for any discounts (applicable only for selective public organizations in Pakistan) or for commercial licensing options.

CLE Urdu HFL 16 Point Size (5586 classes) Document Images

[ Pakistan ] [ International ]

Source:	CLE Urdu Digest Corpus 100K, CLE Urdu Text Corpora for 14 - 40 Point Sizes and 18 million words corpus crawled from online news websites
CLE Catalog #:	CLE14I008
Release Date:	9 January 2014
Data Type:	Image
Language(s):	Urdu
Distribution:	1 DVD, Web Download
Processing Fee (Pakistan):	30000 PKR
Processing Fee (International):	250 USD
License:	Yes

Introduction

CLE Urdu HFL 16 Point Size (5586 classes) Document Images is an image corpus of high frequency Urdu ligatures. Text corpora listed in Data Source section are used to extract five thousand, five hundred and eighty six high frequency ligature classes. These ligature classes cover 131,000 high frequency words (for details see: http://www.cle.org.pk/software/ling_resources/wordlist.htm). Each image in the corpus includes a minimum of thirty samples of each ligature class written using Noori Nastalique writing style at 16 font size and scanned at 300 DPI in grayscale.

Data Source

The Urdu 5586 high frequency ligature classes are extracted from:

CLE12T001 CLE Urdu Digest Corpus 100K covering the domains e.g. education, health, politics, international affairs, sports, business, humor and literature (for details see: http://www.cle.org.pk/clestore/urdudigestcorpus100k.htm).
CLE Urdu Text Corpora for 14 to 40 Point Sizes (for details see: http://cle.org.pk/clestore/index.htm).
18 million words corpus crawled from online news websites covering a wide range of domains including sports, news, finance, culture, etc. (for details see: http://www.cle.org.pk/Publication/papers/2007/corpus_based_urdu_lexicon_development.pdf).

Data

This corpus contains 5586 images in JPEG or BITMAP format. Each image name is labeled as follows:
G(Grayscale)_HFL(HighFrequencyLigature)_<LigatureClass>_<HFLSerialNumber>_F<FontSize> A separate file is maintained which contains information about each image in the corpus including ligature class, image name, printed ligature and its high frequency ligature serial number. This file will be distributed along with the corpus.

Samples


Grayscale Image of "جبکہ" Ligature Class	Grayscale Image of "کبجببے" Ligature Class	Grayscale Image of "مجھے" Ligature Class	Grayscale Image of "فبصد" Ligature Class

webmaster@cle.org.pk