Object Recognition as Next Token Prediction

Kaiyu Yue; Bor-Chun Chen; Jonas Geiping; Hengduo Li; Tom Goldstein; Ser-Nam Lim

2024 CVPR CVPR 2024

Object Recognition as Next Token Prediction

Abstract

We present an approach to pose object recognition as next token prediction. The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels. To ground this prediction process in auto-regression we customize a non-causal attention mask for the decoder incorporating two key features: modeling tokens from different labels to be independent and treating image tokens as a prefix. This masking mechanism inspires an efficient method -- one-shot sampling -- to simultaneously sample tokens of multiple labels in parallel and rank generated labels by their probabilities during inference. To further enhance the efficiency we propose a simple strategy to construct a compact decoder by simply discarding the intermediate blocks of a pretrained language model. This approach yields a decoder that matches the full model's performance while being notably more efficient. The code is available at https://github.com/kaiyuyue/nxtp.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — language decoder

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Kaiyu Yue , Bor-Chun Chen , Jonas Geiping , Hengduo Li , Tom Goldstein , Ser-Nam Lim

Topics

Machine Learning > Core Methods > Classification Deep Learning > Architectures > Transformers Computer Vision > Analysis > Object Detection Machine Learning > Learning Types > Representation Learning Artificial Intelligence > Core AI > Large Language Models Artificial Intelligence > Core AI > Computer Vision Deep Learning > Models > Transformers Computer Vision > Core AI > Computer Vision Deep Learning > Learning Types > Representation Learning Computer Vision > Analysis > Object Classification

Keywords

image classification object recognition text generation auto-regressive model attention mask next token prediction image embedding vision language language decoder language model decoder

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024