Detecting Personal Identifiable Information in Swedish Learner Essays

Maria Irena Szawerna; Simon Dobnik; Ricardo Muñoz Sánchez; Therese Lindström Tiedemann; Elena Volodina

2024 EACL EACL 2024

Detecting Personal Identifiable Information in Swedish Learner Essays

Abstract

AbstractLinguistic data can — and often does — contain PII (Personal Identifiable Information). Both from a legal and ethical standpoint, the sharing of such data is not permissible. According to the GDPR, pseudonymization, i.e. the replacement of sensitive information with surrogates, is an acceptable strategy for privacy preservation. While research has been conducted on the detection and replacement of sensitive data in Swedish medical data using Large Language Models (LLMs), it is unclear whether these models handle PII in less structured and more thematically varied texts equally well. In this paper, we present and discuss the performance of an LLM-based PII-detection system for Swedish learner essays.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing and Security & Privacy

🧭 Keyword Pioneer — swedish learner essay

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Maria Irena Szawerna , Simon Dobnik , Ricardo Muñoz Sánchez , Therese Lindström Tiedemann , Elena Volodina

Topics

Machine Learning > Application Areas > Privacy Security & Privacy > Privacy Artificial Intelligence > Core AI > Privacy Natural Language Processing > Applications > Named Entity Recognition

Keywords

named entity recognition text anonymization personal identifiable information large language model swedish learner essay

Download PDF

Related papers

A Dataset for Metaphor Detection in Early Medieval Hebrew Poetry 2024

PRILoRA: Pruned and Rank-Increasing Low-Rank Adaptation 2024

Overview of the Hate Speech Detection in Turkish and Arabic Tweets (HSD-2Lang) Shared Task at CASE 2024 2024

Evaluating In-Context Learning for Computational Literary Studies: A Case Study Based on the Automatic Recognition of Knowledge Transfer in German Drama 2024

Selam@DravidianLangTech 2024:Identifying Hate Speech and Offensive Language 2024