Towards Accurate Video Text Spotting with Text-wise Semantic Reasoning

Towards Accurate Video Text Spotting with Text-wise Semantic Reasoning

Xinyan Zu, Haiyang Yu, Bin Li, Xiangyang Xue

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence
Main Track. Pages 1858-1866. https://doi.org/10.24963/ijcai.2023/206

Video text spotting (VTS) aims at extracting texts from videos, where text detection, tracking and recognition are conducted simultaneously. There have been some works that can tackle VTS; however, they may ignore the underlying semantic relationships among texts within a frame. We observe that the texts within a frame usually share similar semantics, which suggests that, if one text is predicted incorrectly by a text recognizer, it still has a chance to be corrected via semantic reasoning. In this paper, we propose an accurate video text spotter, VLSpotter, that reads texts visually, linguistically, and semantically. For ‘visually’, we propose a plug-and-play text-focused super-resolution module to alleviate motion blur and enhance video quality. For ‘linguistically’, a language model is employed to capture intra-text context to mitigate wrongly spelled text predictions. For ‘semantically’, we propose a text-wise semantic reasoning module to model inter-text semantic relationships and reason for better results. The experimental results on multiple VTS benchmarks demonstrate that the proposed VLSpotter outperforms the existing state-of-the-art methods in end-to-end video text spotting.
Keywords:
Computer Vision: CV: Vision and language 
Computer Vision: CV: Video analysis and understanding