Center for Language Engineering






[ Text Corpora ] [ Image Corpora ] [ Speech Corpora ] [ Lexical Resources ] [ NLP Applications ]


[ How to Order ]


CLE is making these linguistic resources available without cost for supporting academic, non-commercial research. The processing fees being charged will be used to maintain these resources. You are requested to contact CLE directly for any discounts (applicable only for selective public organizations in Pakistan) or for commercial licensing options.

  CLE Pakistan District Names Speech Corpus - Pashto Speakers
  [ Pakistan ] [ International ]
CLE Catalog #: CLE16S001
Release Date: 12 July 2016
First Language of Speakers: Pashto
Duration: 70 minutes
Number of Utterances: 4456
Distribution: 1 DVD, Web Download
Processing Fee (Pakistan): 15000 PKR
Processing Fee (International): 250 USD
License: Yes
  This package is a collection of speech data of district names of Pakistan recorded from Pashto speakers. The corpus comprises of 139 single word vocabulary items. The data is recorded through mobile channel at a sampling rate of 8 KHz and digitization rate of 16 bits. Gender and district of origin of each speaker is also provided with the corpus. Age of the speakers ranges from 18 to 50 years. The data was collected in outdoor and office environments. The corpus has been cleaned and verified by expert linguists. The data is annotated at word level using CI SAMPA which is mapped on the Urdu IPA symbols.
  Data Source
  Data is collected from students and employees of different universities and research institutes largely from Swat, Quetta, Peshawar, Lower Dir, Pishin, Mardan, Karak, Bannu, Malakand and Dera Ismail Khan.
  List of vocabulary items covered in the corpus is available here. The package contains three folders. The details of each folder are as follows:
  • male: This folder contains audio files from male speakers in wav format.
  • female: This folder contains audio files from female speakers in wav format.
  • info: This folder contains information about corpus.
  Download Sample