Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging

Nils Reimers; Iryna Gurevych

2017 EMNLP EMNLP 2017

Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging

Abstract

AbstractIn this paper we show that reporting a single performance score is insufficient to compare non-deterministic approaches. We demonstrate for common sequence tagging tasks that the seed value for the random number generator can result in statistically significant (p < 10-4) differences for state-of-the-art systems. For two recent systems for NER, we observe an absolute difference of one percentage point F₁-score depending on the selected seed value, making these systems perceived either as state-of-the-art or mediocre. Instead of publishing and reporting single performance scores, we propose to compare score distributions based on multiple executions. Based on the evaluation of 50.000 LSTM-networks for five sequence tagging tasks, we present network architectures that produce both superior performance as well as are more stable with respect to the remaining hyperparameters.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🐣 Hot Topic Early Bird — sequence tagging

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Nils Reimers , Iryna Gurevych

Topics

Machine Learning > Core Methods > Classification Machine Learning > Optimization & Theory > Statistical Learning Natural Language Processing > Understanding > Named Entity Recognition

Keywords

named entity recognition sequence tagging performance evaluation statistical significance neural network

Download PDF

Related papers

Reinforced Video Captioning with Entailment Rewards 2017

Cross-lingual Character-Level Neural Morphological Tagging 2017

Inter-Weighted Alignment Network for Sentence Pair Modeling 2017

Investigating Different Syntactic Context Types and Context Representations for Learning Word Embeddings 2017

An Empirical Analysis of Edit Importance between Document Versions 2017