2022 INTERSPEECH INTERSPEECH 2022

MusicNet: Compact Convolutional Neural Network for Real-time Background Music Detection

Abstract

With the recent growth of remote work, online meetings often encounter challenging audio contexts such as background noise, music, and echo. Accurate real-time detection of music events can help to improve the user experience. In this paper, we present MusicNet, a compact neural model for detecting background music in the real-time communications pipeline. In video meetings, music frequently co-occurs with speech and background noises, making the accurate classification quite challenging. We propose a compact convolutional neural network core preceded by an in-model featurization layer. Music- Net takes 9 seconds of raw audio as input and does not require any model-specific featurization in the product stack. We train our model on the balanced subset of the Audio Set [1] data and validate it on 1000 crowd-sourced real test clips. Finally, we compare MusicNet performance with 20 state-of-the-art models. MusicNet has a true positive rate (TPR) of 81.3% at a 0.1% false-positive rate (FPR), which is significantly better than state-of-the-art models included in our study. MusicNet is also 10x smaller and has 4x faster inference than the best-performing models we benchmarked.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio