MBTI: Metric-Based Textual Inversion for Fine-Grained Image Generation
Abstract
Diffusion-based data augmentation increases semantic diversity but often fails to preserve class-defining cues in fine-grained categories, which harms few-shot classifica- tion. We present Metric-Based Textual Inversion (MBTI), a textual-inversion training scheme that explicitly lever- ages inter-class relations while learning a pseudo-token per class from a few images. At each iteration, MBTI se- lects top-K support embeddings from previously learned classes and pushes the current embedding token away from these supports using a diffusion-based distance computed on denoiser noise predictions under matched latents and timesteps. This metric-aware objective enlarges inter-class margins while retaining class-specific attributes, yielding more representative synthetic samples Across few-shot, fine- grained classification settings, MBTI consistently improves accuracy over textual inversion-based baselines and pro- duces sharper, more discriminative details.