IterMeme: Expert-Guided Multimodal LLM for Interactive Meme Creation with Layout-Aware Generation

Yaqi Cai; Shancheng Fang; Yadong Qu; Xiaorui Wang; Meng Shao; Hongtao Xie

2025 IJCAI IJCAI 2025

IterMeme: Expert-Guided Multimodal LLM for Interactive Meme Creation with Layout-Aware Generation

Abstract

Meme creation is a creative process that blends images and text. However, existing methods lack critical components, failing to support intent-driven caption-layout generation and personalized generation, making it difficult to generate high-quality memes. To address this limitation, we propose IterMeme, an end-to-end interactive meme creation framework that utilizes a unified Multimodal Large Language Model (MLLM) to facilitate seamless collaboration among multiple components. To overcome the absence of a caption-layout generation component, we develop a robust layout representation method and construct a large-scale image-caption-layout dataset, MemeCap, which enhances the model’s ability to comprehend emotions and coordinate caption-layout generation effectively. To address the lack of a personalization component, we introduce a parameter-shared dual-LLM architecture that decouples the intricate representations of reference images and text. Furthermore, we incorporate the expert-guided M³OE for fine-grained identity properties (IP) feature extraction and cross-modal fusion. By dynamically injecting features into every layer of the model, we enable adaptive refinement of both visual and semantic information. Experimental results demonstrate that IterMeme significantly advances the field of meme creation by delivering consistently high-quality outcomes. The code, model, and dataset will be open-sourced to the community.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🧭 Keyword Pioneer — caption-layout coordination

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yaqi Cai , Shancheng Fang , Yadong Qu , Xiaorui Wang , Meng Shao , Hongtao Xie

Topics

Deep Learning > Architectures > Transformers Computer Vision > Generation > Image Captioning Computer Vision > Generation > Image Generation

Keywords

multimodal large language model cross-modal feature fusion layout-aware generation caption-layout coordination

Download PDF

Related papers

Learning Advanced Self-Attention for Linear Transformers in the Singular Value Domain 2025

Responsibility Anticipation and Attribution in LTLf 2025

Argument-based Multi-Issue Negotiation 2025

Online Resource Sharing: Better Robust Guarantees via Randomized Strategies 2025

Equitable Mechanism Design for Facility Location 2025