Part of speech tagging
system consists of two main phases
which are tagset design and implementation of
disambiguation technique. Urdu shares its large
vocabulary with Arabic and Persian and
morphology and syntactic structure with Hindi. However,
there are standard tagging guidelines which
aims at standardizing the tag sets of all languages of
the world. The corpus of training was manually checked
to separate the words by space. Corpus was prepared by
applying normalization, and by removing diacritics and
non-Urdu words. Tagger showed an accuracy of 97.2% while
testing on the data of 10,000 words.
The Part of Speech
Tagger tags the given text using
Urdu Part of Speech Tagset. The tagger takes the
input from "input.txt" file. A file "Tags.txt"
containing open class tags will be used as candidate tag
for unknown word. Output will be saved in a text file
named "results.txt".
The Statistical POS tagger requires
Microsoft .Net Framework
2.0. |