GerDaLIR: A German Dataset for Legal Information Retrieval

Marco Wrzalik; Dirk Krechel

2021 EMNLP EMNLP 2021

GerDaLIR: A German Dataset for Legal Information Retrieval

Abstract

AbstractWe present GerDaLIR, a German Dataset for Legal Information Retrieval based on case documents from the open legal information platform Open Legal Data. The dataset consists of 123K queries, each labelled with at least one relevant document in a collection of 131K case documents. We conduct several baseline experiments including BM25 and a state-of-the-art neural re-ranker. With our dataset, we aim to provide a standardized benchmark for German LIR and promote open research in this area. Beyond that, our dataset comprises sufficient training data to be used as a downstream task for German or multilingual language models.

🌉 Interdisciplinary Bridge — Data Science & Analytics and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — legal information retrieval

🐣 Hot Topic Early Bird — german language

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Marco Wrzalik , Dirk Krechel

Topics

Natural Language Processing > Applications > Information Retrieval Data Science & Analytics > Applications > Information Retrieval Machine Learning > Application Areas > Information Retrieval

Keywords

german language document retrieval benchmark dataset multilingual language model legal information retrieval neural re-ranker neural ranker neural reranking

Download PDF

Related papers

Continual Learning in Multilingual NMT via Language-Specific Embeddings 2021

MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents 2021

Efficient Multi-Task Auxiliary Learning: Selecting Auxiliary Data by Feature Similarity 2021

Neural Machine Translation with Heterogeneous Topic Knowledge Embeddings 2021

Semantics-Preserved Data Augmentation for Aspect-Based Sentiment Analysis 2021