Linear Regression using Heterogeneous Data Batches

Ayush Jain; Rajat Sen; Weihao Kong; Abhimanyu Das; Alon Orlitsky

2024 NIPS NeurIPS 2024

Linear Regression using Heterogeneous Data Batches

Abstract

In many learning applications, data are collected from multiple sources, each providing a \emph{batch} of samples that by itself is insufficient to learn its input-output relationship. A common approach assumes that the sources fall in one of several unknown subgroups, each with an unknown input distribution and input-output relationship. We consider one of this setup's most fundamental and important manifestations where the output is a noisy linear combination of the inputs, and there are $k$ subgroups, each with its own regression vector. Prior work [KSS$^+$20] showed that with abundant small-batches, the regression vectors can be learned with only few, $\tilde\Omega( k^{3/2})$, batches of medium-size with $\tilde\Omega(\sqrt k)$ samples each. However, the paper requires that the input distribution for all $k$ subgroups be isotropic Gaussian, and states that removing this assumption is an ``interesting and challenging problem". We propose a novel gradient-based algorithm that improves on the existing results in several ways. It extends the applicability of the algorithm by: (1) allowing the subgroups' underlying input distributions to be different, unknown, and heavy-tailed; (2) recovering all subgroups followed by a significant proportion of batches even for infinite $k$; (3) removing the separation requirement between the regression vectors; (4) reducing the number of batches and allowing smaller batch sizes.

🧭 Keyword Pioneer — subgroup recovery

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning

🌉 Interdisciplinary Bridge — Machine Learning and Mathematics & Optimization

Authors

Ayush Jain , Rajat Sen , Weihao Kong , Abhimanyu Das , Alon Orlitsky

Topics

Machine Learning > Core Methods > Regression Machine Learning > Learning Types > Unsupervised Learning Machine Learning > Optimization & Theory > Learning Theory Machine Learning > Optimization & Theory > Statistical Learning Mathematics & Optimization > Optimization > Stochastic Methods Machine Learning > Learning Types > Multi-Source Learning

Keywords

multi-task learning multi-source learning batch learning linear regression heavy-tailed distribution heterogeneous datum gradient-based algorithm subgroup recovery subgroup learning heterogeneous batch unknown subgroup regression vector

Download PDF

Related papers

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers 2024

Training for Stable Explanation for Free 2024

NeuralSolver: Learning Algorithms For Consistent and Efficient Extrapolation Across General Tasks 2024

Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch 2024

MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence 2024