Detecting Subtle Sense Shift with Polysemy-Aware Trends
Abstract
AbstractLanguage changes faster than dictionaries can be revised, yet automatic tools still struggle to spot the subtle, short-term shifts in meaning that precede a formal update. We present a language-independent pipeline that detects word-sense shifts in large, time-stamped web corpora. The method couples a robust re-implementation of the Adaptive Skip-Gram model, which induces multiple sense vectors per lemma without any external inventory, with a second stage that tracks each sense through time under three alternative frequency normalizations. Linear Regression and the robust Mann-Kendall/Theil-Sen estimator then test whether a sense’s frequency slope deviates significantly from zero, producing a ranked list of headwords whose semantics are drifting.We evaluate the system on the English (12 B tokens) and Czech (1 B tokens) Timestamped corpora for May 2023-May 2025. Expert annotation of the top-100 candidates for each model variant shows that 50.7% of Czech and 25.7% of English headwords exhibit genuine sense shifts, despite web-scale noise.