Unsupervised Learning of Spoken Language with Visual Context

David Harwath; Antonio Torralba; James Glass

2016 NIPS NeurIPS 2016

Unsupervised Learning of Spoken Language with Visual Context

Abstract

Humans learn to speak before they can read or write, so why can’t computers do the same? In this paper, we present a deep neural network model capable of rudimentary spoken language acquisition using untranscribed audio training data, whose only supervision comes in the form of contextually relevant visual images. We describe the collection of our data comprised of over 120,000 spoken audio captions for the Places image dataset and evaluate our model on an image search and annotation task. We also provide some visualizations which suggest that our model is learning to recognize meaningful words within the caption spectrograms.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Speech & Audio

📈 Trend Setter — Multi-Modal Learning

🧭 Keyword Pioneer — spoken language

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

David Harwath , Antonio Torralba , James Glass

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Unsupervised Learning Computer Vision > Generation > Image Captioning Speech & Audio > Recognition > Speech Recognition Deep Learning > Learning Types > Unsupervised Learning Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

unsupervised learning multimodal learning visual context image annotation deep neural network spoken language spoken language acquisition speech acquisition

Download PDF

Related papers

Bayesian Intermittent Demand Forecasting for Large Inventories 2016

Dynamic Network Surgery for Efficient DNNs 2016

Beyond Exchangeability: The Chinese Voting Process 2016

Safe and Efficient Off-Policy Reinforcement Learning 2016

Tagger: Deep Unsupervised Perceptual Grouping 2016