Seeing isn’t Hearing: Benchmarking Vision Language Models at Interpreting Spectrograms

Tyler Loakman; Joseph James; Chenghua Lin

2025 AACL AACL 2025

Seeing isn’t Hearing: Benchmarking Vision Language Models at Interpreting Spectrograms

Abstract

AbstractWith the rise of Large Language Models (LLMs) and their vision-enabled counterparts (VLMs), numerous works have investigated their capabilities in different tasks that fuse both vision and language modalities. In this work, we benchmark the extent to which VLMs are able to act as highly-trained phoneticians, interpreting spectrograms and waveforms of speech. To do this, we synthesise a novel dataset containing 4k+ English words spoken in isolation alongside stylistically consistent spectrogram and waveform figures. We test the ability of VLMs to understand these representations of speech through a multiple-choice task whereby models must predict the correct phonemic or graphemic transcription of a spoken word when presented amongst 3 distractor transcriptions that have been selected based on their phonemic edit distance to the ground truth. We observe that both zero-shot and finetuned models rarely perform above chance, demonstrating the difficulty of this task stemming from the requirement for esoteric parametric knowledge of how to interpret such figures, rather than paired samples alone.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🧭 Keyword Pioneer — spectrogram interpretation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Tyler Loakman , Joseph James , Chenghua Lin

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Learning Paradigms > Few-Shot Learning Natural Language Processing > Resources & Methods > Large Language Models

Keywords

zero-shot learning benchmark evaluation speech recognition phonetic analysis phoneme recognition vision language model phonetic transcription spectrogram interpretation multimodal benchmarking

Download PDF

Related papers

Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge 2025

Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems 2025

Enhancing Training Data Quality through Influence Scores for Generalizable Classification: A Case Study on Sexism Detection 2025

CtrlShift: Steering Language Models for Dense Quotation Retrieval with Dynamic Prompts 2025

A Diagnostic Framework for Auditing Reference-Free Vision-Language Metrics 2025