Dense Temporal Convolution Network for Sign Language Translation

Dense Temporal Convolution Network for Sign Language Translation

Dan Guo, Shuo Wang, Qi Tian, Meng Wang

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence
Main track. Pages 744-750. https://doi.org/10.24963/ijcai.2019/105

The sign language translation (SLT) which aims at translating a sign language video into natural language is a weakly supervised task, given that there is no exact mapping relationship between visual actions and textual words in a sentence label. To align the sign language actions and translate them into the respective words automatically, this paper proposes a dense temporal convolution network, termed DenseTCN which captures the actions in hierarchical views. Within this network, a temporal convolution (TC) is designed to learn the short-term correlation among adjacent features and further extended to a dense hierarchical structure. In the kth TC layer, we integrate the outputs of all preceding layers together: (1) The TC in a deeper layer essentially has larger receptive fields, which captures long-term temporal context by the hierarchical content transition. (2) The integration addresses the SLT problem by different views, including embedded short-term and extended longterm sequential learning. Finally, we adopt the CTC loss and a fusion strategy to learn the featurewise classification and generate the translated sentence. The experimental results on two popular sign language benchmarks, i.e. PHOENIX and USTCConSents, demonstrate the effectiveness of our proposed method in terms of various measurements.
Keywords:
Computer Vision: Action Recognition
Computer Vision: Biometrics, Face and Gesture Recognition