Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP

Eunji Kim; Kyuhong Shim; Simyung Chang; Sungroh Yoon

2024 EMNLP EMNLP 2024

Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP

Abstract

AbstractA text encoder within Vision-Language Models (VLMs) like CLIP plays a crucial role in translating textual input into an embedding space shared with images, thereby facilitating the interpretative analysis of vision tasks through natural language. Despite the varying significance of different textual elements within a sentence depending on the context, efforts to account for variation of importance in constructing text embeddings have been lacking. We propose a framework of Semantic Token Reweighting to build Interpretable text embeddings (SToRI), which incorporates controllability as well. SToRI refines the text encoding process in CLIP by differentially weighting semantic elements based on contextual importance, enabling finer control over emphasis responsive to data-driven insights and user preferences. The efficacy of SToRI is demonstrated through comprehensive experiments on few-shot image classification and image retrieval tailored to user preferences.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — semantic token reweighting

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Eunji Kim , Kyuhong Shim , Simyung Chang , Sungroh Yoon

Topics

Artificial Intelligence > Core AI > Interpretability Machine Learning > Core Methods > Embedding Learning Machine Learning > Learning Paradigms > Few-Shot Learning Computer Vision > Core AI > Multimodal Learning Deep Learning > Models > Transformers Deep Learning > Techniques > Representation Learning Artificial Intelligence > Core AI > Multi-Modal Learning Deep Learning > Models > Vision-Language Models

Keywords

few-shot learning image retrieval multimodal learning vision-language model text embedding semantic token interpretable embedding few-shot image classification semantic token reweighting

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024