The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG)

Shenglai Zeng; Jiankun Zhang; Pengfei He; Yiding Liu; Yue Xing; Han Xu; Jie Ren; Yi Chang; Shuaiqiang Wang; Dawei Yin; Jiliang Tang

2024 ACL ACL 2024

The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG)

Abstract

AbstractRetrieval-augmented generation (RAG) is a powerful technique to facilitate language model generation with proprietary and private data, where data privacy is a pivotal concern. Whereas extensive research has demonstrated the privacy risks of large language models (LLMs), the RAG technique could potentially reshape the inherent behaviors of LLM generation, posing new privacy issues that are currently under-explored. To this end, we conduct extensive empirical studies with novel attack methods, which demonstrate the vulnerability of RAG systems on leaking the private retrieval database. Despite the new risks brought by RAG on the retrieval data, we further discover that RAG can be used to mitigate the old risks, i.e., the leakage of the LLMs’ training data. In general, we reveal many new insights in this paper for privacy protection of retrieval-augmented LLMs, which could benefit both LLMs and RAG systems builders.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing and Security & Privacy

🧭 Keyword Pioneer — private retrieval database

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Shenglai Zeng , Jiankun Zhang , Pengfei He , Yiding Liu , Yue Xing , Han Xu , Jie Ren , Yi Chang , Shuaiqiang Wang , Dawei Yin , Jiliang Tang

Topics

Artificial Intelligence > Core AI > Responsible AI Machine Learning > Application Areas > Privacy Security & Privacy > Privacy Natural Language Processing > Generation > Retrieval-Augmented Generation

Keywords

data privacy language model retrieval-augmented generation privacy leakage training datum large language model private retrieval database private database

Download PDF

Related papers

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs 2024

EtymoLink: A Structured English Etymology Dataset 2024

Turkish Delights: A Dataset on Turkish Euphemisms 2024

Subjectivity Detection in English News using Large Language Models 2024

Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better 2024