2017 INTERSPEECH INTERSPEECH 2017

CALYOU: A Comparable Spoken Algerian Corpus Harvested from YouTube

Abstract

This paper addresses the issue of comparability of comments extracted from Youtube. The comments concern spoken Algerian that could be either local Arabic, Modern Standard Arabic or French. This diversity of expression gives rise to a huge number of problems concerning the data processing. In this article, several methods of alignment will be proposed and tested. The method which permits to best align is Word2Vec-based approach that will be used iteratively. This recurrent call of Word2Vec allows us improve significantly the results of comparability. In fact, a dictionary-based approach leads to a Recall of 4, while our approach allows one to get a Recall of 33 at rank 1. Thanks to this approach, we built from Youtube CALYOU, a Comparable Corpus of the spoken Algerian.

🧭 Keyword Pioneer — youtube mining
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio