Synthesizing Union Tables from the Web / 2677
Xiao Ling, Alon Halevy, Fei Wu, Cong Yu
Several recent works have focused on harvesting HTML tables from the Web and recovering their semantics. As a result, hundreds of millions of high quality structured data tables can now be explored by the users. In this paper, we argue that those efforts only scratch the surface of the true value of structured data on the Web, and study the challenging problem of synthesizing tables from the Web, i.e., producing never-before-seen tables from raw tables on the Web. Table synthesis offers an important semantic advantage: when a set of related tables are combined into a single union table, powerful mechanisms, such as temporal or geographical comparison and visualization, can be employed to understand and mine the underlying data holistically. We focus on one fundamental task of table synthesis, namely, table stitching. Within a given site, many tables with identical schemas can be scattered across many pages. The task of table stitching involves combining such tables into a single meaningful union table and identifying extra attributes and values for its rows so that rows from different original tables can be distinguished. Specifically, we first define the notion of stitchable tables and identify collections of tables that can be stitched. Second, we design an effective algorithm for extracting hidden attributes that are essential for the stitching process and for aligning values of those attributes across tables to synthesize new columns. We also assign meaningful names to these synthesized columns. Experiments on real world tables demonstrate the effectiveness of our approach.