Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Generation
Computer Vision
›
Generation
›
Image Captioning
781 directly classified papers
Papers per year
2003: 1
2008: 1
2011: 1
2012: 1
2013: 5
2014: 2
2015: 21
2016: 17
2017: 36
2018: 47
2019: 92
2020: 73
2021: 96
2022: 91
2023: 107
2024: 86
2025: 96
2026: 8
Papers
Diffusion Bridge: Leveraging Diffusion Model to Reduce the Modality Gap Between Text and Vision for Zero-Shot Image Captioning
CVPR 2025
Scene Graph and Dependency Grammar Enhanced Remote Sensing Change Caption Network (SGD-RSCCN)
COLING 2025
Towards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models
NAACL 2025
The Power of Many: Multi-Agent Multimodal Models for Cultural Image Captioning
NAACL 2025
Polos: Multimodal Metric Learning from Human Feedback for Image Captioning
CVPR 2024
CIC: A Framework for Culturally-Aware Image Captioning
IJCAI 2024
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
CVPR 2024
Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training
AAAI 2024
Chitranuvad: Adapting Multi-lingual LLMs for Multimodal Translation
EMNLP 2024
Masking Latent Gender Knowledge for Debiasing Image Captioning
NAACL 2024
MAFA: Managing False Negatives for Vision-Language Pre-training
CVPR 2024
Altogether: Image Captioning via Re-aligning Alt-text
EMNLP 2024
KALE: An Artwork Image Captioning System Augmented with Heterogeneous Graph
IJCAI 2024
DIUSum: Dynamic Image Utilization for Multimodal Summarization
AAAI 2024
Retrieval Evaluation for Long-Form and Knowledge-Intensive Image–Text Article Composition
EMNLP 2024
DCU ADAPT at WMT24: English to Low-resource Multi-Modal Translation Task
EMNLP 2024
Segment and Caption Anything
CVPR 2024
ImageCaptioner2: Image Captioner for Image Captioning Bias Amplification Assessment
AAAI 2024
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
NAACL 2024
IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning
EMNLP 2024
Video Discourse Parsing and Its Application to Multimodal Summarization: A Dataset and Baseline Approaches
EMNLP 2024
Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition
EMNLP 2024
On Scaling Up a Multilingual Vision and Language Model
CVPR 2024
AAST-NLP at Multimodal Hate Speech Event Detection 2024 : A Multimodal Approach for Classification of Text-Embedded Images Based on CLIP and BERT-Based Models.
EACL 2024
Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning
CVPR 2024
<
1
…
4
5
6
…
32
>