2024 ECCV ECCV 2024

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models