VisualEDU: A Benchmark for Assessing Coding and Visual Comprehension through Educational Problem-Solving Video Generation

Hao Chen; Tianyu Shi; Pengran Huang; Zeyuan Li; Jiahui Pan; Qianglong Chen; Lewei He

2025 EMNLP EMNLP 2025

VisualEDU: A Benchmark for Assessing Coding and Visual Comprehension through Educational Problem-Solving Video Generation

Abstract

AbstractGenerating logically coherent video from text (T2V) for reasoning-intensive tasks like mathematical problem-solving presents a significant challenge for Vision-Language Models (VLMs). Therefore, we introduce VisualEDU, a benchmark based on Manim package to rigorously evaluate VLM capabilities in producing coherent, step-by-step video solutions for educational purposes, with a framework that integrates meta-prompt learning, visual and code feedback, and a modular drawing toolkit to enhance output quality. Novel metrics for temporal consistency, logical correctness, and visual clarity are proposed, and extensive experiments across nine VLMs reveal that while advanced proprietary models show promise, all struggle significantly with increasing task complexity (e.g., the performances of Claude-3.7-Sonnet and GPT-4o are below 56% on difficult tasks ), highlighting limitations in code generation, visual feedback correction and precise tool invocation. VisualEDU offers a robust platform for systematic T2V assessment in reasoning-intensive domains and guides future VLM improvements in this area.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Interdisciplinary and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Hao Chen , Tianyu Shi , Pengran Huang , Zeyuan Li , Jiahui Pan , Qianglong Chen , Lewei He

Topics

Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Core AI > Procedural Generation Machine Learning > Application Areas > Domain Adaptation Computer Vision > Generation > Video Generation Interdisciplinary > Social > Education Deep Learning > Models > Foundation Models

Keywords

video generation code generation visual reasoning vision-language model temporal consistency educational content educational video

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025