Learning Semantic Descriptions of Web Information Sources

Mark J Carman, Craig A. Knoblock

The Internet is full of information sources providing various types of data from weather forecasts to travel deals. These sources can be accessed via web-forms, Web Services or RSS feeds. In order to make automated use of these sources, one needs to first model them semantically. Writing semantic descriptions for web sources is both tedious and error prone. In this paper we investigate the problem of automatically generating such models. We introduce a framework for learning Datalog definitions for web sources, in which we actively invoke sources and compare the data they produce with that of known sources of information. We perform an inductive search through the space of plausible source definitions in order to learn the best possible semantic model for each new source. The paper includes an empirical evaluation demonstrating the effectiveness of our approach on real-world web sources.