|
[ Text Corpora ] [ Image Corpora ] [ Lexical Resources ] [ NLP Applications ] |
|
|
[ How to Order ] |
|
|
|
|
|
CLE is making these linguistic resources available without cost for supporting academic, non-commercial research. The processing fees being charged will be used to maintain these resources. You are requested to contact CLE directly for any discounts (applicable only for selective public organizations in Pakistan) or for commercial licensing options.
|
|
|
|
|
|
CLE Urdu Digest POS Tagged Corpus 100K |
|
[ Pakistan ] [ International ] |
|
|
|
Source: |
Urdu Digest |
CLE Catalog #: |
CLE12T006 |
Release Date: |
22 June 2012 |
Data Type: |
Text |
Language(s): |
Urdu |
Distribution: |
1 DVD, Web Download |
Processing Fee (Pakistan): |
30000 PKR |
Processing Fee (International): |
250 USD |
License: |
Yes |
|
|
|
|
Introduction |
|
CLE Urdu Digest POS Tagged Corpus is a hundred thousand words collection of written Urdu language from a wide range of domains, designed for the purpose of linguistic research and/or the development of language products. Corpus covers a range of subjects including education, health, politics, international affairs, sports, business, humor and literature. CLE Urdu Digest POS Tagged Corpus is divided into two major categories i.e. Informational (80%) and Imaginative (20%). The Informational part includes texts from letters, interviews, press, religion, sports, culture, entertainment, health and science. The Imaginative part includes texts from short stories and novels, translation of foreign literature and book reviews. |
|
|
|
Data Source |
|
The data for this corpus construction has been taken from Urdu Digest and it ranges between years 2003-2011. Whereas Urdu Digest is a leading general-interest Urdu magazine, with a history of fifty-two years of publication. |
|
|
|
Data |
|
The data is distributed in 348 UTF-8 files and is arranged according to the above mentioned genres. Each file contains minimum three hundred words. |
|
|
|
Sample |
|
|
|
|
|
|
|