Refining BERT Embeddings for Document Hashing via Mutual Information Maximization

Zijing Ou; Qinliang Su; Jianxing Yu; Ruihui Zhao; Yefeng Zheng; Bang Liu

2021 EMNLP EMNLP 2021

Refining BERT Embeddings for Document Hashing via Mutual Information Maximization

Abstract

AbstractExisting unsupervised document hashing methods are mostly established on generative models. Due to the difficulties of capturing long dependency structures, these methods rarely model the raw documents directly, but instead to model the features extracted from them (e.g. bag-of-words (BOG), TFIDF). In this paper, we propose to learn hash codes from BERT embeddings after observing their tremendous successes on downstream tasks. As a first try, we modify existing generative hashing models to accommodate the BERT embeddings. However, little improvement is observed over the codes learned from the old BOG or TFIDF features. We attribute this to the reconstruction requirement in the generative hashing, which will enforce irrelevant information that is abundant in the BERT embeddings also compressed into the codes. To remedy this issue, a new unsupervised hashing paradigm is further proposed based on the mutual information (MI) maximization principle. Specifically, the method first constructs appropriate global and local codes from the documents and then seeks to maximize their mutual information. Experimental results on three benchmark datasets demonstrate that the proposed method is able to generate hash codes that outperform existing ones learned from BOG features by a substantial margin.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

📈 Trend Setter — Retrieval

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zijing Ou , Qinliang Su , Jianxing Yu , Ruihui Zhao , Yefeng Zheng , Bang Liu

Topics

Machine Learning > Core Methods > Representation Learning Machine Learning > Core Methods > Embedding Learning Machine Learning > Learning Types > Unsupervised Learning Deep Learning > Optimization & Theory > Optimization Deep Learning > Learning Types > Representation Learning Machine Learning > Core Methods > Retrieval

Keywords

unsupervised learning representation learning mutual information bert embedding document hashing

Download PDF

Related papers

Continual Learning in Multilingual NMT via Language-Specific Embeddings 2021

MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents 2021

Efficient Multi-Task Auxiliary Learning: Selecting Auxiliary Data by Feature Similarity 2021

Neural Machine Translation with Heterogeneous Topic Knowledge Embeddings 2021

Semantics-Preserved Data Augmentation for Aspect-Based Sentiment Analysis 2021