Leveraging the Wikipedia Graph for Evaluating Word Embeddings

Joachim Giesen; Paul Kahlmeyer; Frank Nussbaum; Sina Zarrieß

doi:10.24963/ijcai.2022/574

Leveraging the Wikipedia Graph for Evaluating Word Embeddings

Joachim Giesen, Paul Kahlmeyer, Frank Nussbaum, Sina Zarrieß

Watch video

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence

Main Track. Pages 4136-4142. https://doi.org/10.24963/ijcai.2022/574

PDF BibTeX

Deep learning models for different NLP tasks often rely on pre-trained word embeddings, that is, vector representations of words. Therefore, it is crucial to evaluate pre-trained word embeddings independently of downstream tasks. Such evaluations try to assess whether the geometry induced by a word embedding captures connections made in natural language, such as, analogies, clustering of words, or word similarities. Here, traditionally, similarity is measured by comparison to human judgment. However, explicitly annotating word pairs with similarity scores by surveying humans is expensive. We tackle this problem by formulating a similarity measure that is based on an agent for routing the Wikipedia hyperlink graph. In this graph, word similarities are implicitly encoded by edges between articles. We show on the English Wikipedia that our measure correlates well with a large group of traditional similarity measures, while covering a much larger proportion of words and avoiding explicit human labeling. Moreover, since Wikipedia is available in more than 300 languages, our measure can easily be adapted to other languages, in contrast to traditional similarity measures.

Keywords:

Natural Language Processing: Embeddings

Agent-based and Multi-agent Systems: Applications

Natural Language Processing: Resources and Evaluation

Search: Local search