Lexicons on Demand: Neural Word Embeddings for Large-Scale Text Analysis

Lexicons on Demand: Neural Word Embeddings for Large-Scale Text Analysis

Ethan Fast, Binbin Chen, Michael S. Bernstein

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence
Best Sister Conferences. Pages 4836-4840. https://doi.org/10.24963/ijcai.2017/677

Human language is colored by a broad range of topics, but existing text analysis tools only focus on a small number of them. We present Empath, a tool that can generate and validate new lexical categories on demand from a small set of seed terms (like "bleed" and "punch" to generate the category violence). Empath draws connotations between words and phrases by learning a neural embedding across billions of words on the web. Given a small set of seed words that characterize a category, Empath uses its neural embedding to discover new related terms, then validates the category with a crowd-powered filter. Empath also analyzes text across 200 built-in, pre-validated categories we have generated such as neglect, government, and social media. We show that Empath's data-driven, human validated categories are highly correlated (r=0.906) with similar categories in LIWC.
Keywords:
Artificial Intelligence: natural language processing
Artificial Intelligence: knowledge representation and reasoning
Artificial Intelligence: human computer interaction