apricot: Submodular selection for data summarization in Python

Jacob Schreiber; Jeffrey Bilmes; William Stafford Noble

2020 JMLR JMLR 2020

apricot: Submodular selection for data summarization in Python

Abstract

We present apricot, an open source Python package for selecting representative subsets from large data sets using submodular optimization. The package implements several efficient greedy selection algorithms that offer strong theoretical guarantees on the quality of the selected set. Additionally, several submodular set functions are implemented, including facility location, which is broadly applicable but requires memory quadratic in the number of examples in the data set, and a feature-based function that is less broadly applicable but can scale to millions of examples. Apricot is extremely efficient, using both algorithmic speedups such as the lazy greedy algorithm and memoization as well as code optimization using numba. We demonstrate the use of subset selection by training machine learning models to comparable accuracy using either the full data set or a representative subset thereof. This paper presents an explanation of submodular selection, an overview of the features in apricot, and applications to two data sets. [abs] [ pdf ][ bib ] [ code ] © JMLR 2020. (edit, beta)

🌉 Interdisciplinary Bridge — Computer Science and Machine Learning and Mathematics & Optimization

🧭 Keyword Pioneer — lazy greedy

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio