ControlText: Unlocking Controllable Fonts in Multilingual Text Rendering without Font Annotations

Bowen Jiang; Yuan Yuan; Xinyi Bai; Zhuoqun Hao; Alyson Yin; Yaojie Hu; Wenyu Liao; Lyle Ungar; Camillo Jose Taylor

2025 EMNLP EMNLP 2025

ControlText: Unlocking Controllable Fonts in Multilingual Text Rendering without Font Annotations

Abstract

AbstractThis work demonstrates that diffusion models can achieve font-controllable multilingual text rendering using just raw images without font label annotations. Visual text rendering remains a significant challenge. While recent methods condition diffusion on glyphs, it is impossible to retrieve exact font annotations from large-scale, real-world datasets, which prevents user-specified font control. To address this, we propose a data-driven solution that integrates the conditional diffusion model with a text segmentation model, utilizing segmentation masks to capture and represent fonts in pixel space in a self-supervised manner, thereby eliminating the need for any ground-truth labels and enabling users to customize text rendering with any multilingual font of their choice. The experiment provides a proof of concept of our algorithm in zero-shot text and font editing across diverse fonts and languages, providing valuable insights for the community and industry toward achieving generalized visual text rendering.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — font control

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Bowen Jiang , Yuan Yuan , Xinyi Bai , Zhuoqun Hao , Alyson Yin , Yaojie Hu , Wenyu Liao , Lyle Ungar , Camillo Jose Taylor

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Self-Supervised Learning Deep Learning > Models > Diffusion Models Computer Vision > Generation > Image Generation

Keywords

zero-shot learning self-supervised learning image editing multilingual text diffusion model text rendering font control

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025