From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification

Andre Martins; Ramón Astudillo

2016 ICML ICML 2016

From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification

Abstract

We propose sparsemax, a new activation function similar to the traditional softmax, but able to output sparse probabilities. After deriving its properties, we show how its Jacobian can be efficiently computed, enabling its use in a network trained with backpropagation. Then, we propose a new smooth and convex loss function which is the sparsemax analogue of the logistic loss. We reveal an unexpected connection between this new loss and the Huber classification loss. We obtain promising empirical results in multi-label classification problems and in attention-based neural networks for natural language inference. For the latter, we achieve a similar performance as the traditional softmax, but with a selective, more compact, attention focus.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — sparse attention

🐣 Hot Topic Early Bird — natural language inference

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Andre Martins , Ramón Astudillo

Topics

Machine Learning > Core Methods > Classification Deep Learning > Architectures > Transformers

Keywords

attention mechanism natural language inference multi-label classification sparse attention sparsemax activation

Download PDF

Related papers

Associative Long Short-Term Memory 2016

Recycling Randomness with Structure for Sublinear time Kernel Expansions 2016

Stochastically Transitive Models for Pairwise Comparisons: Statistical and Computational Issues 2016

Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization 2016

Hawkes Processes with Stochastic Excitations 2016