2025 AAAI AAAI 2025

Scalable Vision-Language Understanding and Generation

Abstract

Abstract Recent advances in vision-language models have shown remarkable potential, yet creating scalable systems that can effectively understand and generate across modalities remains challenging. This talk will present our contributions to advancing scalable vision-language systems, focusing on three key themes: (1) efficient vision-language understanding, including our work on temporal perceiving video-language pre-training and knowledge-enhanced zero-shot retrieval; (2) scalable generation frameworks, encompassing our innovations in zero-shot captioning and co-speech gesture generation; and (3) practical applications and deployments of these technologies. We will discuss how these advances have enabled both better performance and improved efficiency in real-world scenarios, and explore future directions for scalable multimodal systems.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors