2021 IJCAI IJCAI 2021

Isotonic Data Augmentation for Knowledge Distillation

Abstract

Knowledge distillation uses both real hard labels and soft labels predicted by teacher model as supervision. Intuitively, we expect the soft label probabilities and hard label probabilities to be concordant. However, in the real knowledge distillations, we found critical rank violations between hard labels and soft labels for augmented samples. For example, for an augmented sample x = 0.7 * cat + 0.3 * panda, a meaningful soft label distribution should have the same rank: P(cat|x)>P(panda|x)>P(other|x). But real teacher models usually violate the rank: P(tiger|x)>P(panda|x)>P(cat|x). We attribute the rank violations to the increased difficulty of understanding augmented samples for the teacher model. Empirically, we found the violations injuries the knowledge transfer. In this paper, we denote eliminating rank violations in data augmentation for knowledge distillation as isotonic data augmentation (IDA). We use isotonic regression (IR) -- a classic statistical algorithm -- to eliminate the rank violations. We show that IDA can be modeled as a tree-structured IR problem and gives an O(c*log(c)) optimal algorithm, where c is the number of labels. In order to further reduce the time complexity of the optimal algorithm, we also proposed a GPU-friendly approximation algorithm with linear time complexity. We have verified on variant datasets and data augmentation baselines that (1) the rank violation is a general phenomenon for data augmentation in knowledge distillation. And (2) our proposed IDA algorithms effectively increases the accuracy of knowledge distillation by solving the ranking violations.

🧭 Keyword Pioneer — rank violation
🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors