LLM-Evolve: Evaluation for LLM’s Evolving Capability on Benchmarks

Jiaxuan You; Mingjie Liu; Shrimai Prabhumoye; Mostofa Patwary; Mohammad Shoeybi; Bryan Catanzaro

2024 EMNLP EMNLP 2024

LLM-Evolve: Evaluation for LLM’s Evolving Capability on Benchmarks

Abstract

AbstractThe advancement of large language models (LLMs) has extended their use to dynamic and interactive real-world applications, where models engage continuously with their environment and potentially enhance their performance over time. Most existing LLM benchmarks evaluate LLMs on i.i.d. tasks, overlooking their ability to learn iteratively from past experiences. Our paper bridges this evaluation gap by proposing a novel framework, LLM-Evolve, which extends established benchmarks to sequential problem-solving settings. LLM-Evolve evaluates LLMs over multiple rounds, providing feedback after each round to build a demonstration memory that the models can query in future tasks. We applied LLM-Evolve to the MMLU, GSM8K, and AgentBench benchmarks, testing 8 state-of-the-art open-source and closed-source models. Results show that LLMs can achieve performance improvements of up to 17% by learning from past interactions, with the quality of retrieval algorithms and feedback significantly influencing this capability. These insights advocate for more understanding and benchmarks for LLMs’ performance in evolving interactive scenarios.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🧭 Keyword Pioneer — interactive scenario

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jiaxuan You , Mingjie Liu , Shrimai Prabhumoye , Mostofa Patwary , Mohammad Shoeybi , Bryan Catanzaro

Topics

Artificial Intelligence > Core AI > Agent Systems Machine Learning > Learning Types > Continual Learning Artificial Intelligence > Core AI > Large Language Models Deep Learning > Learning Types > Multi-Task Learning Deep Learning > Learning Types > In-Context Learning

Keywords

benchmark evaluation in-context learning iterative learning performance improvement large language model interactive scenario sequential problem-solving demonstration memory retrieval algorithm

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024