Center for Language Engineering

 
 



 

 

KICS
KICS-UET


 
 

[ Localization ] [ Language Processing ] [ Linguistic Resources ]

 
   
  Urdu Ligatures from Corpus  
     
  Release Notes  
 

The wordlist has been extracted from 19.3 million corpus gathered from a wide range of domains
as mentioned in the following table, keeping in view the end user perspective.

 
     
 

Domains

Sub domains

   
  C1. Sports/Games   C1.1. Sports (special events)
  C2. News
 
  C2.1. Local and international affairs
  C2.2. Editorials and opinions
  C3. Finance
 
  C3.1. Business, domestic and
          foreign market
  C4. Culture/Entertainment

 

  C4.1. Music, theatre,exhibitions,
          review articles on literature
  C4.2. Travel / tourism
  C5. Consumer Information

 

  C5.1. Health
  C5.2. Popular science
  C5.3. Consumer technology
  C6. Personal communications
 
  C6.1. Emails, online, discussions,
          editorials, e-zines
   
 
  Domain wise corpus size distribution is given in the following table.  
     
 

Domains

Raw Corpora

Size

Distinct words

     
  C1. Sports/Games 1666304 23118
  C2. News 8957259 67365
  C3. Finance 1162019 17024
  C4. Culture/Entertainment 3845117 59214
  C5. Consumer Information 1980723 34151
  C6. Personal communications 1685424 30469
     
Total 19296846 104341
 
       
  The list is cleaned for non Urdu characters and is not validated for other issues of the corpus i.e. spelling mistakes, other languages quoted in Urdu text i.e. Arabic, Punjabi.  
     
  Download (This file has been accessed: times, since 20 January 2011)  
 

Urdu Ligatures from Corpus

License  
     
 

webmaster@cle.org.pk