Learning to Detect Concepts from Webly-Labeled Video Data / 1746
Junwei Liang, Lu Jiang, Deyu Meng, Alexander Hauptmann
Learning detectors that can recognize concepts, such as people actions, objects, etc., in video content is an interesting but challenging problem. In this paper, we study the problem of automatically learning detectors from the big video data on the web without any additional manual annotations. The contextual information available on the web provides noisy labels to the video content. To leverage the noisy web labels, we propose a novel method called WEbly-Labeled Learning (WELL). It is established on two theories called curriculum learning and self-paced learning and exhibits useful properties that can be theoretically verified. We provide compelling insights on the latent non-convex robust loss that is being minimized on the noisy data. In addition, we propose two novel techniques that not only enable WELL to be applied to big data but also lead to more accurate results. The efficacy and the scalability of WELL have been extensively demonstrated on two public benchmarks, including the largest multimedia dataset and the largest manually-labeled video set. Experimental results show that WELL significantly outperforms the state-of-the-art methods. To the best of our knowledge, WELL achieves by far the best reported performance on these two webly-labeled big video datasets.