|
|
[ Text Corpora ] [ Image Corpora ] [ Speech Corpora ] [ Lexical Resources ] [ NLP Applications ] |
|
|
[ How to Order ] |
|
|
|
|
|
CLE is making these linguistic resources available without cost for supporting academic, non-commercial research. The processing fees being charged will be used to maintain these resources. You are requested to contact CLE directly for any discounts (applicable only for selective public organizations in Pakistan) or for commercial licensing options.
|
|
|
|
|
|
CLE Urdu Broadcast Speaker Identification Corpus |
|
|
|
CLE Catalog #: |
CLE23S011 |
Release Date: |
26 September 2023 |
First Language of Speakers: |
Urdu |
Duration: |
276 Hours |
Distribution: |
Free Download |
Processing Fee (Pakistan): |
0 PKR |
Processing Fee (International): |
0 USD |
License: |
Yes |
|
|
|
|
Introduction |
|
This package comprises a compilation of speaker identification data sourced from major Urdu broadcast news channels in Pakistan, primarily from their YouTube channels. Data is selected from a diverse range including talk shows, interviews, press conferences, and addresses from national assemblies. Total speakers covered are 1184 and with a total duration of audio 276 hours. |
|
|
|
Data Source |
|
Data is collected from Youtube channels of Geo News, ARY News, Samaa TV, Dunya News, BOL Network, PTV News, Aaj News, Express News, 92 News, Hum News, GNN, Dawn Media Group, Abb Takk News, and 24 Digital. |
|
|
|
Dataset |
|
The dataset package comprises three essential CSV files named "speaker_mapping," "Channel_tags," and "UBCSpk_detail_sheet," along with a folder labeled "Tdfs," all of which collectively provide comprehensive information about the dataset.
- Speaker Mapping (speaker_mapping.csv): It contains detailed information about individual speakers, including their names and unique identification numbers. Moreover, it provides a count of how many times each speaker appears across different selected audio sources.
- Channel Tags (Channel_tags.csv): Details about the channels from which the speaker data was sourced is specified. Each entry in this file includes the name of the channel and relevant tags associated with it.
- UBCSpk Detail Sheet (UBCSpk_detail_sheet.csv): This file offers a systematic overview of the dataset's audio and video content. It lists the names of the files alongside corresponding Youtube URL links for reference. Additionally, it provides the duration of each video.
- Tdfs Folder: The "Tdfs" folder houses a collection of TDF files. These TDF files contain valuable metadata about the segments, making it possible to generate segments.
|
|
|
|
Download |
|
CLE Broadcast Speech SID |
|
|
|
|
|
|
|