YOLOPitch: A Time-Frequency Dual-Branch YOLO Model for Pitch Estimation

Xuefei Li; Hao Huang; Ying Hu; Liang He; Jiabao Zhang; Yuyi Wang

2024 INTERSPEECH INTERSPEECH 2024

YOLOPitch: A Time-Frequency Dual-Branch YOLO Model for Pitch Estimation

Abstract

Pitch estimation is of fundamental importance in audio processing and music information retrieval. YOLO is a well developed model designed for image target detection. Here we introduce YOLOv7 into pitch estimation task and improve by proposing time-frequency (TF) dual-branch into the model according to pitch perception of human auditory. An additional advantage of the model over the state-of-the-art (SOTA) models is that it only needs to add an unvoiced class without additional unvoiced/voiced detection to achieve joint pitch estimation and voiced determination. Experiments show for both music and speech, the proposed TF dual-branch can boost pitch estimation accuracy over the back-bone. Our model exhibits superior pitch estimation performance over the SOTA and shows minimal performance degradation in noisy condition. The overall accuracy on the MDB-stem-synth dataset peaks at 99.4%, and voicing determination F-score reaches 99.9%.

🧭 Keyword Pioneer — voiced determination

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning and Speech & Audio

Authors

Xuefei Li , Hao Huang , Ying Hu , Liang He , Jiabao Zhang , Yuyi Wang

Topics

Machine Learning > Core Methods > Classification Machine Learning > Core Methods > Representation Learning Machine Learning > Learning Types > Self-Supervised Learning Deep Learning > Architectures > Neural Networks Computer Vision > Analysis > Object Detection Speech & Audio > Analysis > Speech Analysis

Keywords

deep learning time-frequency analysis audio processing music information retrieval pitch estimation voiced determination neural network

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024