|
|
[ Text Corpora ] [ Image Corpora ] [ Lexical Resources ] [ NLP Applications ] |
|
|
[ How to Order ] |
|
|
|
|
|
CLE is making these linguistic resources available without cost for supporting academic, non-commercial research. The processing fees being charged will be used to maintain these resources. You are requested to contact CLE directly for any discounts (applicable only for selective public organizations in Pakistan) or for commercial licensing options.
|
|
|
|
|
|
CLE Urdu Books N-Grams |
|
[ Pakistan ] [ International ] |
|
|
|
Source: |
Urdu Books |
CLE Catalog #: |
CLE14L002 |
Release Date: |
20 September 2014 |
Data Type: |
Text |
Language(s): |
Urdu |
Distribution: |
1 DVD, Web Download |
Processing Fee (Pakistan): |
30000 PKR |
Processing Fee (International): |
250 USD |
License: |
Yes |
Citation: |
Farah Adeeba, Qurat-ul-ain Akram, Hina Khalid, and Sarmad Hussain "CLE Urdu Books N-grams", poster presentation in Conference on Language and Technology 2014 (CLT 14), Karachi, Pakistan. |
|
|
|
|
Introduction |
|
CLE Urdu Books N-grams is a 37 million words collection extracted from 861 Urdu books covering a wide range of domains including articles, biography, character representation, culture, foreign literature, health, history, interviews, letters, magazines, novels, plays, religion, reviews, science, short stories, travel and Urdu literature. The extracted n-grams include unigrams, bigrams, and trigrams. These N-grams can be used in any statistical Urdu Natural Language Processing (NLP) and Information Retrieval application. For details regarding N-grams extraction process see: http://www.cle.org.pk/Publication/papers/2014/CLE%20Urdu%20Books%20N-grams.pdf). |
|
|
|
Data Source |
|
Urdu books crawled from Urdu Library have been used for generation of CLE Urdu Books N-grams. |
|
|
|
Data |
|
The data is distributed in three UTF-8 files i.e. cleurdubooks-1gram.txt, cleurdubook-2gram.txt and cleurdubooks-3gram.txt. In each of the distributed file, first column contains N-gram count and second column contains N-gram entry. |
|
|
|
Sample |
|
The following representations give examples of unigram, bigram and trigram data. |
|
اس |
667561 |
کا |
512887 |
کہ |
497054 |
کو |
488531 |
نے |
470545 |
ہیں |
357980 |
کر |
352518 |
ہو |
322267 |
یہ |
308311 |
|
نہ ہو |
18167 |
ہے جس |
18093 |
رہا تھا |
18074 |
کی طرح |
17874 |
سے روایت |
17860 |
اس طرح |
17778 |
میں ایک |
17764 |
رہے ہیں |
17682 |
اس وقت |
17637 |
|
کی راہ میں |
2249 |
اس وقت تک |
2236 |
اس بات کی |
2229 |
اور ان کو |
2221 |
، اس لیے |
2218 |
سے لے کر |
2209 |
ہے اور ان |
2204 |
ہے اور نہ |
2203 |
نے اس سے |
2200 |
|
|
|
|
|
|
|