Speech-Image Semantic Alignment Does Not Depend on Any Prior Classification Tasks

Masood S. Mortazavi

2020 INTERSPEECH INTERSPEECH 2020

Speech-Image Semantic Alignment Does Not Depend on Any Prior Classification Tasks

Abstract

Semantically-aligned ( speech; image) datasets can be used to explore “visually-grounded speech”. In a majority of existing investigations, features of an image signal are extracted using neural networks “pre-trained” on other tasks (e.g., classification on ImageNet). In still others, pre-trained networks are used to extract audio features prior to semantic embedding. Without “transfer learning” through pre-trained initialization or pre-trained feature extraction, previous results have tended to show low rates of recall in speech → image and image → speech queries. Choosing appropriate neural architectures for encoders in the speech and image branches and using large datasets, one can obtain competitive recall rates without any reliance on any pre-trained initialization or feature extraction: ( speech; image) semantic alignment and speech → image and image → speech retrieval are canonical tasks worthy of independent investigation of their own and allow one to explore other questions — e.g., the size of the audio embedder can be reduced significantly with little loss of recall rates in speech → image and image → speech queries.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — speech-image alignment

🐣 Hot Topic Early Bird — semantic alignment

Authors

Masood S. Mortazavi

Topics

Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Learning Paradigms > Transfer Learning Machine Learning > Learning Types > Representation Learning Computer Vision > Core AI > Multimodal Learning Deep Learning > Learning Types > Representation Learning

Keywords

representation learning transfer learning multimodal learning visual grounding cross-modal retrieval semantic alignment semantic embedding neural encoder speech-image alignment

Download PDF

Related papers

Memory Controlled Sequential Self Attention for Sound Recognition 2020

Dual Attention in Time and Frequency Domain for Voice Activity Detection 2020

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer 2020

A Noise Robust Technique for Detecting Vowels in Speech Signals 2020

Joint Detection of Sentence Stress and Phrase Boundary for Prosody 2020