Contemporary LLMs struggle with extracting formal legal arguments
Abstract
AbstractLegal Argument Mining (LAM) is a complex challenge for humans and language models alike. This paper explores the application of Large Language Models (LLMs) in LAM, focusing on the identification of fine-grained argument types within judgment texts. We compare the performance of Flan-T5 and Llama 3 models against a baseline RoBERTa model to study if the advantages of magnitude-bigger LLMs can be leveraged for this task. Our study investigates the effectiveness of fine-tuning and prompting strategies in enhancing the models’ ability to discern nuanced argument types. Despite employing state-of-the-art techniques, our findings indicate that neither fine-tuning nor prompting could surpass the performance of a domain-pre-trained encoder-only model. This highlights the challenges and limitations in adapting general-purpose large language models to the specialized domain of legal argumentation. The insights gained from this research contribute to the ongoing discourse on optimizing NLP models for complex, domain-specific tasks. Our code and data for reproducibility are available at https://github.com/trusthlt/legal-argument-spans.