What’s Hidden in a One-layer Randomly Weighted Transformer?

Sheng Shen; Zhewei Yao; Douwe Kiela; Kurt Keutzer; Michael Mahoney

2021 EMNLP EMNLP 2021

What’s Hidden in a One-layer Randomly Weighted Transformer?

Abstract

AbstractWe demonstrate that, hidden within one-layer randomly weighted neural networks, there exist subnetworks that can achieve impressive performance, without ever modifying the weight initializations, on machine translation tasks. To find subnetworks for one-layer randomly weighted neural networks, we apply different binary masks to the same weight matrix to generate different layers. Hidden within a one-layer randomly weighted Transformer, we find that subnetworks that can achieve 29.45/17.29 BLEU on IWSLT14/WMT14. Using a fixed pre-trained embedding layer, the previously found subnetworks are smaller than, but can match 98%/92% (34.14/25.24 BLEU) of the performance of, a trained Transformersmall/base on IWSLT14/WMT14. Furthermore, we demonstrate the effectiveness of larger and deeper transformers in this setting, as well as the impact of different initialization methods.

❓ The Questioner

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — randomly weighted transformer

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Sheng Shen , Zhewei Yao , Douwe Kiela , Kurt Keutzer , Michael Mahoney

Topics

Machine Learning > Learning Types > Unsupervised Learning Deep Learning > Architectures > Transformers Natural Language Processing > Applications > Machine Translation Machine Learning > Learning Types > Transfer Learning Artificial Intelligence > Core AI > Large Language Models Deep Learning > Models > Transformers

Keywords

transformer architecture machine translation neural architecture search weight initialization subnetwork extraction binary mask randomly weighted transformer randomly weighted neural network

Download PDF

Related papers

Continual Learning in Multilingual NMT via Language-Specific Embeddings 2021

MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents 2021

Efficient Multi-Task Auxiliary Learning: Selecting Auxiliary Data by Feature Similarity 2021

Neural Machine Translation with Heterogeneous Topic Knowledge Embeddings 2021

Semantics-Preserved Data Augmentation for Aspect-Based Sentiment Analysis 2021