Arsha Nagrani

40 papers · 2017–2025 · 9 conferences · across top CS/AI conferences

Achievements

🐣 Hot Topic Early Bird 🌈 Renaissance Researcher (5) 🌉 Interdisciplinary Bridge 🧭 Keyword Pioneer 🌍 Conference Polyglot (9) 🏃 Academic Marathon (8) 🐝 Cross-Pollinator (10) 🗺️ Taxonomy Completionist (69) 👥 Mega-Team (43) 🔬 Deep Specialist (17) 🤝 Dynamic Duo (20) 🧬 Topic Evolution 🚀 Conference Pioneer 🗃️ Keyword Collector (170) 📈 Trend Setter 💎 Century Club (40) 🔥 Unstoppable (9) ❓ The Questioner ⚡ Prolific Year (7)

Conferences

CVPR (15) ICCV (8) INTERSPEECH (5) ECCV (4) NIPS (3) ACL (2) EMNLP (1) IJCNLP (1) WACV (1)

Top co-authors

Cordelia Schmid (20) Andrew Zisserman (14) Anurag Arnab (10) Chen Sun (8) Paul Hongsuck Seo (6) Max Bain (5) Weidi Xie (5) Shyamal Buch (5) Gül Varol (5) Tengda Han (4)

Research topics

Keywords

multimodal learning (16) video understanding (12) video captioning (5) contrastive learning (4) temporal localization (4) large language model (4) visual question answering (3) audio description (3) vision-language model (3) video question answering (3) self-supervised learning (3) automatic speech recognition (3) speaker verification (3) cross-modal learning (3) video-language model (3) dense video captioning (3) efficient computing (2) transformer architecture (2) zero-shot learning (2) semantic alignment (2)

Papers

Flexible Frame Selection for Efficient Video Reasoning CVPR 2025

MINERVA: Evaluating Complex Video Reasoning ICCV 2025

Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation ICCV 2025

Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks CVPR 2025

VIEWS: Entity-Aware News Video Captioning EMNLP 2024

Mixture of Nested Experts: Adaptive Processing of Visual Tokens NIPS 2024

Streaming Dense Video Captioning CVPR 2024

On Scaling Up a Multilingual Vision and Language Model CVPR 2024

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering CVPR 2024

AutoAD III: The Prequel - Back to the Pixels CVPR 2024

VicTR: Video-conditioned Text Representations for Activity Recognition CVPR 2024

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning CVPR 2023

AutoAD: Movie Description in Context CVPR 2023

UnLoc: A Unified Framework for Video Localization Tasks ICCV 2023

LanSER: Language-Model Supported Speech Emotion Recognition INTERSPEECH 2023

AutoAD II: The Sequel - Who, When, and What in Movie Audio Description ICCV 2023

VidChapters-7M: Video Chapters at Scale NIPS 2023

AVFormer: Injecting Vision Into Frozen Speech Models for Zero-Shot AV-ASR CVPR 2023

Modular Visual Question Answering via Code Generation ACL 2023

Verbs in Action: Improving Verb Understanding in Video-Language Models ICCV 2023

AVATAR: Unconstrained Audiovisual Speech Recognition INTERSPEECH 2022

Masking Modalities for Cross-Modal Video Retrieval WACV 2022

Learning Audio-Video Modalities from Image Captions ECCV 2022

TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency ECCV 2022

End-to-End Generative Pretraining for Multimodal Video Captioning CVPR 2022

Attention Bottlenecks for Multimodal Fusion NIPS 2021

Recognizing Multimodal Entailment ACL 2021

Localizing Visual Sounds the Hard Way CVPR 2021

Look Before You Speak: Visually Contextualized Utterances CVPR 2021

Composable Augmentation Encoding for Video Representation Learning ICCV 2021

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval ICCV 2021

Recognizing Multimodal Entailment IJCNLP 2021

Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos ECCV 2020

Spot the Conversation: Speaker Diarisation in the Wild INTERSPEECH 2020

Speech2Action: Cross-Modal Supervision for Action Recognition CVPR 2020

EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition ICCV 2019

VoxCeleb2: Deep Speaker Recognition INTERSPEECH 2018

Learnable PINs: Cross-Modal Embeddings for Person Identity ECCV 2018

Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching CVPR 2018

VoxCeleb: A Large-Scale Speaker Identification Dataset INTERSPEECH 2017