TopGuNN: Fast NLP Training Data Augmentation using Large Corpora

Rebecca Iglesias-Flores; Megha Mishra; Ajay Patel; Akanksha Malhotra; Reno Kriz; Martha Palmer; Chris Callison-Burch

2021 NAACL NAACL 2021

TopGuNN: Fast NLP Training Data Augmentation using Large Corpora

Abstract

AbstractAcquiring training data for natural language processing systems can be expensive and time-consuming. Given a few training examples crafted by experts, large corpora can be mined for thousands of semantically similar examples that provide useful variability to improve model generalization. We present TopGuNN, a fast contextualized k-NN retrieval system that can efficiently index and search over contextual embeddings generated from large corpora. TopGuNN is demonstrated for a training data augmentation use case over the Gigaword corpus. Using approximate k-NN and an efficient architecture, TopGuNN performs queries over an embedding space of 4.63TB (approximately 1.5B embeddings) in less than a day.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🐣 Hot Topic Early Bird — text retrieval

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Rebecca Iglesias-Flores , Megha Mishra , Ajay Patel , Akanksha Malhotra , Reno Kriz , Martha Palmer , Chris Callison-Burch

Topics

Machine Learning > Application Areas > Data Augmentation Natural Language Processing > Resources & Methods > Text Representation

Keywords

data augmentation text retrieval k-nearest neighbor model generalization contextual embedding

Download PDF

Related papers

Knowledge Router: Learning Disentangled Representations for Knowledge Graphs 2021

Cross-Task Instance Representation Interactions and Label Dependencies for Joint Information Extraction with Graph Convolutional Networks 2021

Abstract Meaning Representation Guided Graph Encoding and Decoding for Joint Information Extraction 2021

Beyond Fair Pay: Ethical Implications of NLP Crowdsourcing 2021

Probing Word Translations in the Transformer and Trading Decoder for Encoder Layers 2021