CLER: Improving Multimodal Financial Reasoning by Cross-MLLM Error Reflection

Shuangyan Deng; Zhongsheng Wang; Rui Mao; Ciprian Doru Giurcăneanu; Jiamou Liu

2026 AAAI AAAI 2026

CLER: Improving Multimodal Financial Reasoning by Cross-MLLM Error Reflection

Abstract

Abstract Recent advances in Multimodal Large Language Models (MLLMs) have enabled joint reasoning over financial textual and visual inputs. However, they still struggle with financial terminology, logical consistency, and numerical computations. Moreover, while commercial large models perform well on reasoning tasks, their high inference costs limit their scalable usage in real world financial applications. We thus propose a cost-effective framework, CLER, that combines contrastive retrieval with step-wise reflection to improve reasoning performance. Also, the reasoning cost is only generated in the test stage when using commercial large models. CLER leverages FinErrorSet, a dataset of 8,000+ mistake correction pairs from diverse open-source MLLMs. A fine grained retriever is trained to identify structurally relevant errors for self-correction through individual reflection. Experiments on three benchmarks show that CLER consistently outperforms other baselines. To our knowledge, CLER is the first framework to use cross-model errors for financial reasoning.

🧭 Keyword Pioneer — error reflection

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Shuangyan Deng , Zhongsheng Wang , Rui Mao , Ciprian Doru Giurcăneanu , Jiamou Liu

Topics

Artificial Intelligence > Core AI > Multimodal Learning

Keywords

multimodal large language model financial reasoning contrastive retrieval error reflection

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026