Leveraging Domain Adaptation and Data Augmentation to Improve Qur’anic IR in English and Arabic

Vera Pavlova

2023 EMNLP EMNLP 2023

Leveraging Domain Adaptation and Data Augmentation to Improve Qur’anic IR in English and Arabic

Abstract

AbstractIn this work, we approach the problem of Qur’anic information retrieval (IR) in Arabic and English. Using the latest state-of-the-art methods in neural IR, we research what helps to tackle this task more efficiently. Training retrieval models requires a lot of data, which is difficult to obtain for training in-domain. Therefore, we commence with training on a large amount of general domain data and then continue training on in-domain data. To handle the lack of in-domain data, we employed a data augmentation technique, which considerably improved results in MRR@10 and NDCG@5 metrics, setting the state-of-the-art in Qur’anic IR for both English and Arabic. The absence of an Islamic corpus and domain-specific model for IR task in English motivated us to address this lack of resources and take preliminary steps of the Islamic corpus compilation and domain-specific language model (LM) pre-training, which helped to improve the performance of the retrieval models that use the domain-specific LM as the shared backbone. We examined several language models (LMs) in Arabic to select one that efficiently deals with the Qur’anic IR task. Besides transferring successful experiments from English to Arabic, we conducted additional experiments with retrieval task in Arabic to amortize the scarcity of general domain datasets used to train the retrieval models. Handling Qur’anic IR task combining English and Arabic allowed us to enhance the comparison and share valuable insights across models and languages.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — religious text retrieval

🐣 Hot Topic Early Bird — semantic search

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Vera Pavlova

Topics

Machine Learning > Application Areas > Data Augmentation Machine Learning > Application Areas > Domain Adaptation Natural Language Processing > Applications > Information Retrieval Machine Learning > Learning Types > Domain Adaptation Machine Learning > Learning Types > Data Augmentation Deep Learning > Learning Types > Domain Adaptation Deep Learning > Learning Types > Data Augmentation

Keywords

domain adaptation data augmentation information retrieval semantic search language model neural information retrieval neural retrieval cross-lingual retrieval religious text retrieval

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023