Research United States

Video Understanding is a New Vista for AI

The success AI has achieved in image recognition has prompted tech companies to explore the new field of video understanding, in which machines are trained to answer questions like “Who is in this video?” or “What are the cats in the video doing?”

AI-powered machines are now able to recognize images more accurately than human beings. This has prompted tech companies to explore the new field of video understanding, in which machines are trained to answer questions like “Who is in this video?” or “What are the cats in the video doing?”

In a session on video understanding at last week’s AI Frontiers Conference, Google Principal Scientist Rahul Sukthankar, Facebook Manager of Computer Vision Manohar Paluri, and Alibaba iDST (Institute of Data Science and Technologies) Chief Scientist Xiaofeng Ren all agreed that video understanding has an unfathomable potential if energized by AI.

Deep learning has certainly delivered better results than previous methods in video understanding research, said Sukthankar. Five years ago, multiple steps were required between input and output of training models, including manually designed descriptors and codebook histograms; now, deep learning offers end-to-end solutions by directly feeding data into the model. Deep-learned models can effect an 80% improvement in mean average precision over models using hand-crafted features.

Deep learning is already being used to optimize YouTube services such as large-scale video annotation and automated thumbnail selection.

Sukthankar said Google is planning to use video understanding to train robots to learn human movements from videos. At the conference Google introduced its time-contrastive network, a neural network that simulates actions in a video and learns basic movements such as standing or bending.

The above mentioned research cannot be achieved without appropriately using large-scale open-sourced video datasets. Sukthankar said the characteristics of different video datasets correspond to different video understanding research fields. For example, Sports-1M and Youtube-8M are designed for video annotations; HUMOS, Kinetics, and Google’s recently released dataset AVA are used for action recognition; while YouTube-BB And Open Images can be applied to train models for object recognition.

Paluri from Facebook introduced the company’s newly released open-source visual data platform, dubbed Lumos. Based on FBLearner Flow, Lumos is a platform for image and video understanding. Facebook engineers need not be trained in deep learning or computer vision to train and deploy a new model using Lumos.

Paluri also announced the exciting news that Facebook will release two new datasets early next year: Scenes Objects & Actions (SOA) and Generic Motions.

Ren from Alibaba discussed application scenarios for video understanding, focusing on how to apply it to Alibaba’s e-commerce business. For example, Alibaba is able to recognize objects in video content and connect to a shopping weblink at Taobao (an Amazon-like platform). This year, Alibaba began allowing allow Taobao sellers to upload promotional videos. Taobao can then analyze the video content to improve product search.

It’s not just the tech giants that are achieving significant improvements in video understanding. Berlin-Montreal AI startup Twenty Billion Neurons GmbH (TwentyBN) introduced an AI system called Super Model that can observe actions in the real physical world and output a live caption of what it sees. Last year TwentyBN announced a funding round of US$2.5 million, and invited Dr. Yoshua Bengio to become an advisor.

Thanks to AI, machines are rapidly developing a clear and accurate perception of humans’ dynamic physical environments. And it would seem the more they can understand humans, the more humanlike they can become.

Journalist: Tony Peng | Editor: Michael Sarazen

0 comments on “Video Understanding is a New Vista for AI

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: