Learning Question Paraphrases for QA from Encarta Logs

Shiqi Zhao, Ming Zhou, Ting Liu

Question paraphrasing is critical in many Natural Language Processing (NLP) applications, especially for question reformulation in question answering (QA). However, choosing an appropriate data source and developing effective methods are challenging tasks. In this paper, we propose a method that exploits Encarta logs to automatically identify question paraphrases and extract templates. Questions from Encarta logs are partitioned into small clusters, within which a perceptron classier is used for identifying question paraphrases. Experiments are conducted and the results have shown: (1) Encarta log data is an eligible data source for question paraphrasing and the user clicks in the data are indicative clues for recognizing paraphrases; (2) the supervised method we present is effective, which can evidently outperform the unsupervised method. Besides, the features introduced to identify paraphrases are sound; (3) the obtained question paraphrase templates are quite effective in question reformulation, enhancing the MRR from 0.2761 to 0.4939 with the questions of TREC QA 2003.