Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words

Kaitlyn Zhou; Kawin Ethayarajh; Dallas Card; Dan Jurafsky

2022 ACL ACL 2022

Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words

Abstract

AbstractCosine similarity of contextual embeddings is used in many NLP tasks (e.g., QA, IR, MT) and metrics (e.g., BERTScore). Here, we uncover systematic ways in which word similarities estimated by cosine over BERT embeddings are understated and trace this effect to training data frequency. We find that relative to human judgements, cosine similarity underestimates the similarity of frequent words with other instances of the same word or other words across contexts, even after controlling for polysemy and other factors. We conjecture that this underestimation of similarity for high frequency words is due to differences in the representational geometry of high and low frequency words and provide a formal argument for the two-dimensional case.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Kaitlyn Zhou , Kawin Ethayarajh , Dallas Card , Dan Jurafsky

Topics

Machine Learning > Core Methods > Representation Learning Natural Language Processing > Resources & Methods > Text Representation

Keywords

Download PDF

KG-CRuSE: Recurrent Walks over Knowledge Graph for Explainable Conversation Reasoning using Semantic Embeddings 2022

Toward Knowledge-Enriched Conversational Recommendation Systems 2022

Investigating the Medical Coverage of a Translation System into Pictographs for Patients with an Intellectual Disability 2022

TableFormer: Robust Transformer Modeling for Table-Text Encoding 2022

Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words

Abstract

Authors

Topics

Keywords

Related papers