Center for Language Engineering

 
 



 

 

KICS
KICS-UET


 
 

[ Text Corpora ] [ Image Corpora ] [ Lexical Resources ] [ NLP Applications ]

 
 

[ How to Order ]

 
   
 

CLE is making these linguistic resources available without cost for supporting academic, non-commercial research. The processing fees being charged will be used to maintain these resources. You are requested to contact CLE directly for any discounts (applicable only for selective public organizations in Pakistan) or for commercial licensing options.

 
     
  CLE Urdu Books N-Grams [ Pakistan ] [ International ]
   
 
Source: Urdu Books
CLE Catalog #: CLE14L002
Release Date: 20 September 2014
Data Type: Text
Language(s): Urdu
Distribution: 1 DVD, Web Download
Processing Fee (Pakistan): 30000 PKR
Processing Fee (International): 250 USD
License: Yes
Citation: Farah Adeeba, Qurat-ul-ain Akram, Hina Khalid, and Sarmad Hussain "CLE Urdu Books N-grams", poster presentation in Conference on Language and Technology 2014 (CLT 14), Karachi, Pakistan.
   
  Introduction
  CLE Urdu Books N-grams is a 37 million words collection extracted from 861 Urdu books covering a wide range of domains including articles, biography, character representation, culture, foreign literature, health, history, interviews, letters, magazines, novels, plays, religion, reviews, science, short stories, travel and Urdu literature. The extracted n-grams include unigrams, bigrams, and trigrams. These N-grams can be used in any statistical Urdu Natural Language Processing (NLP) and Information Retrieval application. For details regarding N-grams extraction process see: http://www.cle.org.pk/Publication/papers/2014/CLE%20Urdu%20Books%20N-grams.pdf).
   
  Data Source
  Urdu books crawled from Urdu Library have been used for generation of CLE Urdu  Books N-grams.
   
  Data
  The data is distributed in three UTF-8 files i.e. cleurdubooks-1gram.txt, cleurdubook-2gram.txt and cleurdubooks-3gram.txt. In each of the distributed file, first column contains N-gram count and second column contains N-gram entry.
 
  Sample
  The following representations give examples of unigram, bigram and trigram data.
 

اس

667561

کا

512887

کہ

497054

کو

488531

نے

470545

ہیں

357980

کر

352518

ہو

322267

یہ

308311

نہ ہو

18167

ہے جس

18093

رہا تھا

18074

کی طرح

17874

سے روایت

17860

اس طرح

17778

میں ایک

17764

رہے ہیں

17682

اس وقت

17637

کی راہ میں

2249

اس وقت تک

2236

اس بات کی

2229

اور ان کو

2221

، اس لیے

2218

سے لے کر

2209

ہے اور ان

2204

ہے اور نہ

2203

نے اس سے

2200

 
 
 

webmaster@cle.org.pk