ADI17 dataset for Fine-grained Arabic Dialect Identification (ADI)
The task of ADI is dialect identification of speech from YouTube to one of the 17 dialects. The previous studies on Arabic dialect identification using audio signal is limited to 5 dialect classes by lack of speech corpus. To present a fine-grained analysis on the Arabic dialect speech, we collected Arabic dialect from YouTube.
For Train set, about 3,000 hours of Arabic dialect speech data from 17 countries on the Arabic world was collected from YouTube. Since we collected the speech by considering the YouTube channels in a specific country, certain that the dataset might have some labeling errors. For this reason, we have two sub-tracks for the ADI task, supervised learning track and unsupervised track. Thus, the label of the train set can be either used or not and it completely depends on the choice of participants.
For the Dev and Test set, about 280 hours speech data was collected from YouTube. After automatic speaker linking and dialect labeling by human annotators, we selected 57 hours of speech dataset to use as Dev and Test set for performance evaluation. The test dataset was considered to have three sub-categories by the segment duration to represent short (under 5 sec), medium(between 5 sec and 20 sec), long duration (over 20 sec) of the dialectal speech.
You can find more details (labels, YouTube ids) about the ADI17 dataset here.
To request a passcode to access the dataset, please send a e-mail to :
swshon (at) csail (dot) mit (dot) edu
The ADI17 datset is available to download for research purposes under
a Creative Commons Attribution-ShareAlike 4.0 International License.
The copyright remains with the original owners of the video.
A complete version of the license can be found here.