CLE Store

Center for Language Engineering

[ Text Corpora ] [ Image Corpora ] [ Speech Corpora ] [ Lexical Resources ] [ NLP Applications ]

CLE is making these linguistic resources available without cost for supporting academic, non-commercial research. The processing fees being charged will be used to maintain these resources. You are requested to contact CLE directly for any discounts (applicable only for selective public organizations in Pakistan) or for commercial licensing options.

CLE Urdu Broadcast Speaker Identification Corpus

CLE Catalog #:	CLE23S011
Release Date:	26 September 2023
First Language of Speakers:	Urdu
Duration:	276 Hours
Distribution:	Free Download
Processing Fee (Pakistan):	0 PKR
Processing Fee (International):	0 USD
License:	Yes

Introduction

This package comprises a compilation of speaker identification data sourced from major Urdu broadcast news channels in Pakistan, primarily from their YouTube channels. Data is selected from a diverse range including talk shows, interviews, press conferences, and addresses from national assemblies. Total speakers covered are 1184 and with a total duration of audio 276 hours.

Data Source

Data is collected from Youtube channels of Geo News, ARY News, Samaa TV, Dunya News, BOL Network, PTV News, Aaj News, Express News, 92 News, Hum News, GNN, Dawn Media Group, Abb Takk News, and 24 Digital.

Dataset

The dataset package comprises three essential CSV files named "speaker_mapping," "Channel_tags," and "UBCSpk_detail_sheet," along with a folder labeled "Tdfs," all of which collectively provide comprehensive information about the dataset.

Speaker Mapping (speaker_mapping.csv): It contains detailed information about individual speakers, including their names and unique identification numbers. Moreover, it provides a count of how many times each speaker appears across different selected audio sources.
Channel Tags (Channel_tags.csv): Details about the channels from which the speaker data was sourced is specified. Each entry in this file includes the name of the channel and relevant tags associated with it.
UBCSpk Detail Sheet (UBCSpk_detail_sheet.csv): This file offers a systematic overview of the dataset's audio and video content. It lists the names of the files alongside corresponding Youtube URL links for reference. Additionally, it provides the duration of each video.
Tdfs Folder: The "Tdfs" folder houses a collection of TDF files. These TDF files contain valuable metadata about the segments, making it possible to generate segments.

Download

CLE Broadcast Speech SID

webmaster@cle.org.pk