Divide Rows and Conquer Cells: Towards Structure Recognition for Large Tables

Huawen Shen; Xiang Gao; Jin Wei; Liang Qiao; Yu Zhou; Qiang Li; Zhanzhan Cheng

doi:10.24963/ijcai.2023/152

Divide Rows and Conquer Cells: Towards Structure Recognition for Large Tables

Huawen Shen, Xiang Gao, Jin Wei, Liang Qiao, Yu Zhou, Qiang Li, Zhanzhan Cheng

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence

Main Track. Pages 1369-1377. https://doi.org/10.24963/ijcai.2023/152

PDF BibTeX

Recent advanced Table Structure Recognition (TSR) models adopt image-to-text solutions to parse table structure. These methods can be formulated as image caption problem, i.e., input a single-table image and output table structure description in a specific text format, e.g., HTML. With the impressive success of Transformer in text generation tasks, these methods use Transformer architecture to predict HTML table text in an autoregressive manner. However, tables always emerge with a large variety of shapes and sizes. Autoregressive models usually suffer from the error accumulation problem as the length of predicted text increases, which results in unsatisfactory performance for large tables. In this paper, we propose a novel image-to-text based TSR method that relieves error accumulation problems and improves performance noticeably. At the core of our method is a cascaded two-step decoder architecture with the former decoder predicting HTML table row tags non-autoregressively and the latter predicting HTML table cell tags of each row in a semi-autoregressive manner. Compared with existing methods that predict HTML text autoregressively, the superiority of our row-to-cell progressive table parsing is twofold: (1) it generates an HTML tag sequence with a vertical-and-horizontal two-step `scanning', which better fits the inherent 2D structure of image data, (2) it performs substantially better for large tables (long sequence prediction) since it alleviates error accumulation problem specific to autoregressive models. Extensive experiments demonstrate that our method achieves competitive performance on three public benchmarks.

Keywords:

Computer Vision: CV: Recognition (object detection, categorization)

Computer Vision: CV: Applications