Sketch the Storyline with CHARCOAL: A Non-Parametric Approach / 3841
Siliang Tang, Fei Wu, Si Li, Weiming Lu, Zhongfei Zhang, Yueting Zhuang
Generating a coherent synopsis and revealing the development threads for news stories from the increasing amounts of news content remains aformidable challenge. In this paper, we proposed a hddCRP (hybird distant-dependent ChineseRestaurant Process) based HierARChical tOpic model for news Article cLustering, abbreviated as CHARCOAL. Given a bunch of news articles, the outcome of CHARCOAL is threefold: 1) it aggregates relevant new articles into clusters (i.e., stories); 2) it disentangles the chain links (i.e., storyline) between articles in their describing story; 3) it discerns the topics that each story is assigned (e.g., Malaysia Airlines Flight 370 story belongs to the aircraft accident topic and U.S presidential election stories belong to the politics topic). CHARCOAL completes this task by utilizing a hddCRP as prior, and the entities (e.g., names of persons, organizations, or locations) that appear in news articles as clues. Moveover, the adaptation of nonparametric nature in CHARCOAL makes our model can adaptively learn the appropriate number of stories and topics from news corpus. The experimental analysis and results demonstrate both interpretability and superiority of the proposed approach.