Low-Latency Online Streaming VideoQA Using Audio-Visual Transformers

Chiori Hori; Takaaki Hori; Jonathan Le Roux

2022 INTERSPEECH INTERSPEECH 2022

Low-Latency Online Streaming VideoQA Using Audio-Visual Transformers

Abstract

To apply scene-aware interaction technology to real-time dialog systems, we propose an online low-latency response generation framework for scene-aware interaction using a video question answering setup. This paper extends our prior work on low-latency video captioning to build a novel approach that can optimize the timing to generate each answer under a trade-off between latency of generation and quality of answer. For video QA, the timing detector is now in charge of finding a timing for the question-relevant event, instead of determining when the system has seen enough to generate a general caption as in the video captioning case. Our audio visual scene-aware dialog system built for the 10th Dialog System Technology Challenge was extended to exploit a low-latency function. Experiments with the MSRVTT-QA and AVSD datasets show that our approach achieves between 97% and 99% of the answer quality of the upper bound given by a pre-trained Transformer using the entire video clips, using less than 40% of frames from the beginning.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — scene-aware interaction

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Chiori Hori , Takaaki Hori , Jonathan Le Roux

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Architectures > Transformers Computer Vision > Processing > Video Understanding Natural Language Processing > Applications > Question Answering Machine Learning > Learning Types > Deep Learning Computer Vision > Analysis > Video Understanding

Keywords

video question answering dialogue system low-latency inference audio-visual processing scene-aware interaction

Download PDF

Related papers

Example-based Explanations with Adversarial Attacks for Respiratory Sound Analysis 2022

Which Model is Best: Comparing Methods and Metrics for Automatic Laughter Detection in a Naturalistic Conversational Dataset 2022

Evidence of Onset and Sustained Neural Responses to Isolated Phonemes from Intracranial Recordings in a Voice-based Cursor Control Task 2022

Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications 2022

Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction 2022