2024 INTERSPEECH INTERSPEECH 2024

YOLOPitch: A Time-Frequency Dual-Branch YOLO Model for Pitch Estimation

Abstract

Pitch estimation is of fundamental importance in audio processing and music information retrieval. YOLO is a well developed model designed for image target detection. Here we introduce YOLOv7 into pitch estimation task and improve by proposing time-frequency (TF) dual-branch into the model according to pitch perception of human auditory. An additional advantage of the model over the state-of-the-art (SOTA) models is that it only needs to add an unvoiced class without additional unvoiced/voiced detection to achieve joint pitch estimation and voiced determination. Experiments show for both music and speech, the proposed TF dual-branch can boost pitch estimation accuracy over the back-bone. Our model exhibits superior pitch estimation performance over the SOTA and shows minimal performance degradation in noisy condition. The overall accuracy on the MDB-stem-synth dataset peaks at 99.4%, and voicing determination F-score reaches 99.9%.

🧭 Keyword Pioneer — voiced determination
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio
🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning and Speech & Audio