Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval

Shubhashis Roy Dipta; Francis Ferraro

2025 AACL AACL 2025

Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval

Abstract

AbstractRecent approaches have shown impressive proficiency in extracting and leveraging parametric knowledge from Large-Language Models (LLMs) and Vision-Language Models (VLMs). In this work, we consider how we can improve the retrieval of videos related to complex real-world events by automatically extracting latent parametric knowledge about those events. We present Q2E: a Query-to-Event decomposition method for zero-shot multilingual text-to-video retrieval, adaptable across datasets, domains, LLMs, or VLMs. Our approach demonstrates that we can enhance the understanding of otherwise overly simplified human queries by decomposing the query using the knowledge embedded in LLMs and VLMs. We additionally show how to apply our approach to both visual and speech-based inputs. To combine this varied multimodal knowledge, we adopt entropy-based fusion scoring for zero-shot fusion. Q2E outperforms the previous SOTA on the MultiVENT dataset by 8 NDCG points, while improving on MSR-VTT and MSVD by 4 and 3 points, respectively, outperforming several existing retrieval methods, including many fine-tuned and SOTA zero-shot approaches. We have released both code and data.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🧭 Keyword Pioneer — event decomposition

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Shubhashis Roy Dipta , Francis Ferraro

Topics

Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Applications > Information Retrieval Natural Language Processing > Resources & Methods > Multilingual NLP Artificial Intelligence > Learning Paradigms > Zero-Shot Learning

Keywords

zero-shot learning multilingual retrieval multimodal learning multilingual text vision-language model multimodal fusion event extraction cross-lingual retrieval zero-shot retrieval query decomposition text-to-video retrieval large language model event decomposition entropy-based fusion query-to-event decomposition

Download PDF

Related papers

Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge 2025

Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems 2025

Enhancing Training Data Quality through Influence Scores for Generalizable Classification: A Case Study on Sexism Detection 2025

CtrlShift: Steering Language Models for Dense Quotation Retrieval with Dynamic Prompts 2025

A Diagnostic Framework for Auditing Reference-Free Vision-Language Metrics 2025