2018 INTERSPEECH INTERSPEECH 2018

Machine Learning Powered Data Platform for High-Quality Speech and NLP Workflows

Abstract

Machine learning (ML) models - like deep neural networks - require substantial amounts of training data. Also, the training dataset should be properly annotated to obtain satisfactory results. This paper describes a platform designed to create high-quality datasets. By using data workflows adapted for speech technologies and natural language processing systems, the user can collect and enrich speech and text data. Depending on the end goal, the data is passed through multiple processing steps based on human input and ML services. To guarantee data quality, the platform combines several mechanisms like language tests, real-time audits and user behavior into several ML models that act as quality gateways.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning
📈 Trend Setter — Foundation Models
🧭 Keyword Pioneer — data annotation
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio
🐣 Hot Topic Early Bird — data annotation