Analysis of Text Accuracy and Visual Alignment in Vision-Language Models for Artistic Text Generation
Abstract
Artistic text generation involves rendering textual content in visually creative and contextually appropriate designs, such as floral or geometric patterns. Despite advancements in Vision-Language Models (VLMs) like DALL*E and Qwen, maintaining both text accuracy and aesthetic alignment is a persistent challenge in this domain. This paper presents an analysis of the performance of several VLMs in generating artistic text, aiming to diagnose performance gaps and guide improvements in text accuracy and multimodal alignment. The objective of this study is to propose an analysis framework to evaluate and improve artistic text generation by assessing models across three key dimensions: character-level text accuracy, semantic similarity, and visual-context alignment. A custom dataset of 1,000 prompts, each specifying a word and artistic style to support systematic evaluation was created. We benchmarked the performance of DALL*E, Qwen-VL, and Qwen-2.5B on text accuracy by comparing the generated text to the original prompt and measuring perceptual alignment using CLIP embeddings. The results show that the DALL-E model outperforms the other models, in balancing text accuracy and style alignment. This study highlights the potential of adopting VLMs for artistic text generation. Furthermore, we present exploratory results using ControlNet, focusing on its potential to improve text accuracy in generated images. Our findings expose critical limitations in current VLMs and offer actionable insights for advancing multimodal generation architectures, datasets, and evaluation strategies, as well as provide a framework for improving text accuracy and aesthetic coherence in creative applications, paving the way for advancements in multimodal AI systems.