Exploiting Background Knowledge to Build Reference Sets for Information Extraction

Previous work on information extraction from unstructured, ungrammatical text (e.g. classified ads) showed that exploiting a set of background knowledge, called a "reference set," greatly improves the precision and recall of the extractions. However, finding a source for this reference set is often difficult, if not impossible. Further, even if a source is found, it might not overlap well with the text for extraction. In this paper we present an approach to building the reference set directly from the text itself. Our approach eliminates the need to find the source for the reference set, and ensures better overlap between the text and reference set. Starting with a small amount of background knowledge, our technique constructs tuples representing the entities in the text to form a reference set. Our results show that our method outperforms manually constructed reference sets, since hand built reference sets may not overlap with the entities in the unstructured, ungrammatical text. We also ran experiments comparing our method to the supervised approach of Conditional Random Fields (CRFs) using simple, generic features. These results show our method achieves an improvement in F1-measure for 6/9 attributes and is competitive in performance on the others, and this is without training data.

Matthew Michelson, Craig A. Knoblock