|
This wordlist comprises of 5000 content words of which approx 3000 words have been extracted from the
following three sources, and additional 2000 words have been added during the development of Urdu
WordNet based on the initial 3000 words.
1. 18 million words corpus crawled from online newswebsites covering a wide range of domains
including sports, news, finance, culture, etc. (Fordetails see:
http://www.cle.org.pk/Publication/papers/2007/corpus_based_urdu_lexicon_development.pdf)
2. CLE12T001 CLE Urdu Digest Corpus 100K covering the domains e.g. education, health, politics,
international affairs, sports, business, humor and literature. (For details see:
http://www.cle.org.pk/clestore/urdudigestcorpus100k.htm)
3. Urdu Verb List extracted from Urdu Lughat. (Fordetails see:
http://www.cle.org.pk/software/ling_resources/urduverblist.htm)
Selection of words in this wordlist is based on thefollowing parameters:
1. Lexemes have been included and their inflectional forms are not included
2. Closed form compound words have been included
3. Multiple correct spellings have been included
4. Foreign words are included in the list if they are listed in the Urdu Lughat (available at
OUD) or if they occur at least 20 times in the CLE Urdu Digest one million
words corpus (available at http://www.cle.org.pk/clestore/urdudigestcorpus1M.htm)
This wordlist forms the basis of the Urdu WordNet 1.0 developed by Center for Language Engineering,
KICS, UET, Lahore.
This work has been developed through the project grant for Essential Urdu Linguistic Resources
(www.cle.org.pk/eulr) in collaboration with University of Konstanz (http://www.uni-konstanz.de/),Germany
and funded by German Academic Exchange Service, DAAD (https://www.daad.org/), Germany. |
|