2019 INTERSPEECH INTERSPEECH 2019

Multimodal Word Discovery and Retrieval with Phone Sequence and Image Concepts

Abstract

This paper demonstrates three different systems capable of performing the multimodal word discovery task. A multimodal word discovery system accepts, as input, a database of spoken descriptions of images (or a set of corresponding phone transcripts), and learns a lexicon which is a mapping from phone strings to their associated image concepts. Three systems are demonstrated: one based on a statistical machine translation (SMT) model, two based on neural machine translation (NMT). On Flickr8k, the SMT-based model performs much better than the NMT-based one, achieving a 49.6% F1 score. Finally, we apply our word discovery system to the task of image retrieval and achieve 29.1% recall@10 on the standard 1000-image Flickr8k tests set.

🌉 Interdisciplinary Bridge — Computer Vision and Natural Language Processing
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio