Leveraging the Wikipedia Graph for Evaluating Word Embeddings

Leveraging the Wikipedia Graph for Evaluating Word Embeddings

Joachim Giesen, Paul Kahlmeyer, Frank Nussbaum, Sina Zarrieß

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence
Main Track. Pages 4136-4142. https://doi.org/10.24963/ijcai.2022/574

Deep learning models for different NLP tasks often rely on pre-trained word embeddings, that is, vector representations of words. Therefore, it is crucial to evaluate pre-trained word embeddings independently of downstream tasks. Such evaluations try to assess whether the geometry induced by a word embedding captures connections made in natural language, such as, analogies, clustering of words, or word similarities. Here, traditionally, similarity is measured by comparison to human judgment. However, explicitly annotating word pairs with similarity scores by surveying humans is expensive. We tackle this problem by formulating a similarity measure that is based on an agent for routing the Wikipedia hyperlink graph. In this graph, word similarities are implicitly encoded by edges between articles. We show on the English Wikipedia that our measure correlates well with a large group of traditional similarity measures, while covering a much larger proportion of words and avoiding explicit human labeling. Moreover, since Wikipedia is available in more than 300 languages, our measure can easily be adapted to other languages, in contrast to traditional similarity measures.
Keywords:
Natural Language Processing: Embeddings
Agent-based and Multi-agent Systems: Applications
Natural Language Processing: Resources and Evaluation
Search: Local search