|
[ Text Corpora ] [ Image Corpora ] [ Lexical Resources ] [ NLP Applications ] |
|
|
[ How to Order ] |
|
|
|
|
|
CLE is making these linguistic resources available without cost for supporting academic, non-commercial research. The processing fees being charged will be used to maintain these resources. You are requested to contact CLE directly for any discounts (applicable only for selective public organizations in Pakistan) or for commercial licensing options.
|
|
|
|
|
|
CLE Urdu Text Corpus 20 Point Size |
|
[ Pakistan ] [ International ] |
|
|
|
Source: |
|
CLE Catalog #: |
CLE14T008 |
Release Date: |
12 July 2012 |
Data Type: |
Text |
Language(s): |
Urdu, English |
Distribution: |
1 DVD, Web Download |
Processing Fee (Pakistan): |
30000 PKR |
Processing Fee (International): |
250 USD |
License: |
Yes |
|
|
|
|
Introduction |
|
CLE Urdu Text Corpus 20 Point Size is typed corpus of CLE Urdu Image Corpus 20 Point Size. All one hundred and forty nine (149) pages are typed by two typists. Cross comparison of both typed versions has been carried out to finalize the text corpus. This corpus has coverage of Urdu character set, Urdu digits, Latin digits, English characters, Urdu aerab and special symbols of Urdu. This corpus has been selected from different domains (if available) such as literature, poetry, religion, biography, novel, interviews, culture/travel, history, autobiography, science, short stories and character representation. |
|
|
|
Data Source |
|
CLE Urdu Text Corpus is the typed version of CLE Urdu Image Corpus 20 Point Size, which is collected from forty five (45) Urdu books. |
|
|
|
Data |
|
This corpus contains one hundred and forty nine (149) typed files, in UTF-8 file format. Each file name is labeled as follows:
G(Grayscale)_UE(UnEdited)_B<Book ID>_<Document type>_P<page number>_F<font size>.txt The document type of the page can be normal text represented as R, image represented as I or table of content represented as T. |
|
|
|
Sample |
|
|
|
|
|
|
|
|
|
|
|
|
|