CVPR 2017: The Fusion of Deep Learning and Computer Vision, What’s Next?

The 2017 Conference on Computer Vision and Pattern Recognition (CVPR) was hosted from July 21st to July 26th in Honolulu, Hawaii. This year’s conference accepted 783 papers out of 2,620 valid submissions, which included 215 long and short presentations and 3 parallel tracks. The conference attracted 127 sponsors with $859,000 in sponsorship funds, and close to 5,000 people attended, a significant improvement from just over a thousand a few years ago.

In this article, we will help you recap and walk through this carnival (coffee was great!). Let’s get started.

image (2).png

Main Conference

CVPR is one of the most influential computer vision conferences. During its 4-day main conference (July 22nd – July 25th), it covered the following main topics:

Machine Learning
Object Recognition & Scene Understanding – Computer Vision & Language
3D Vision
Human Analyzing
Low- & Mid- Level Vision
Image Motion & Tracking: Video Analysis
Computational Photography
Applications

The first four topics accounted for more than 80% of the accepted papers. We will first introduce these four topics.

Machine Learning

In the Machine Learning Session, most long and short presentations were focused on breaking through the performance limits in existing models, while a few excellent papers did deep dives into understanding the mechanism of Neural Network:

Densely Connected Convolutional Networks. This is one of the recipients for the best paper awards. This work introduces DenseNet, a novel network that maintains a denser network architecture as the network goes deeper. Its advantages over vanilla CNN include: keeping a stronger gradient flow, substantially enhanced computational efficiency, etc. “They partially answered the question on how NN works, and they conducted their research by exploring the unknowns rather than only tweaking the NN architecture,” a researcher commented.
Global Optimality in Neural Network Training. This paper shows that global optimal could be reached, as long as both the network output and the regularization to be positively homogeneous functions of the network parameters. In short, a ReLU function can be seen as homogeneous because of max(0, ax) = a * max(0, x), whereas softmax cannot. What’s more, it extends the theory to multiple AlexNet connected in parallel. This interesting paper can be used as a guide to design and train Neural Network models.

A noteworthy work was presented:

Unsupervised Pixel-Level Domain Adaptation With Generative Adversarial Networks. This work, Super Resolution Generative Adversarial Network, proposes that perceptual loss replaces mean squared error. Applied to GAN, perceptual loss consists of content loss and adversarial loss, Thus, this loss function leads GAN to represent high-level contents rather than pixel-level. Furthermore, a new metric for measuring perceptual loss is proposed.

As Machine Learning, especially its sub-domain of Deep Learning, becomes incredibly effective in dealing with computer vision problems, it is no surprise why ML/DL has become the mainstream this year (since 2012 AlexNet shocks the world, ML/DL has left no room for traditional computer vision methods). “Machine Learning dominates computer vision! It’s exciting and there will be more opportunities!” one of the interviewees showed his wild enthusiasm during coffee break. However, some were worried “Very few researchers pay attention to pushing ML/DL theories forward, this is not good,” because DL works as a tool. Regardless whether you accept this trend or not, ML and DL are dominating everywhere.

Although Deep Learning has become quite popular, it is not enough by just implementing a single Deep Learning model. Among all the papers related to Deep Learning, the concepts and methods of Machine Learning cannot be ignored: with the help of Machine Learning concepts, Deep Learning’s performance has been extended, and is more easily explained; also, the fusion of Machine Learning metrics and Deep Network have become very popular, due to the simple fact that it can generate better models.

3D Vision

Topics in 3D Vision include reconstruction, segmentations, etc. Compared to 2D image processing, the extra dimension introduces more uncertainties, such as occlusions and different cameras angles. Researchers have spent much effort to deal with different situations, and the two presentations that received the biggest rounds of applause were both given by groups from Princeton University:

Semantic Scene Completion from a Single Depth Image. The main objective of this work is to reconstruct objects from a single image. However, the inherent ambiguity in 3D scenario lowers the accuracy of reconstruction. To deal with this problem, this paper proposes a data-driven method: building Neural Network with large dataset learning as the knowledge base. This prior knowledge alleviates the pain of occlusions. The new model has the ability to infer the object by recognizing other objects around it; while prior knowledge greatly increases accuracy.
3DMatch: Learning Local Geometric Descriptors From RGB-D Reconstructions. This paper also introduces the data-driven model for a prior knowledge base. To deal with the problem of insufficient training data, they employ self-supervised learning to generate data, namely from different angles, capturing long-range correspondence.

As mentioned, the nature of dimensionality challenges researchers with the noisy, low resolution, and incomplete scan data. Current works are beginning to capture global semantic meanings and matching them with local geometry patterns. However, the size of current datasets may no longer support cutting edge researches. Thus, the next research goal may shift to developing a properly designed dataset for 3D Vision. Other papers like Noise Robust Depth From Focus Using a Ring Difference Filter, Learning From Noisy Large-Scale Datasets With Minimal Supervision, Global Hypothesis Generation for 6D Object Pose Estimation, Multi-Scale Continuous CRFs as Sequential Deep Networks for Monocular Depth Estimation, are dealing with noisy data and estimation problems. “I’m interested in the Geometric Deep Learning, it will be the new trend.” one Ph.D. student said.

Object Recognition & Scene Understanding

Object recognition is another major topic this year. In the past, researchers have done lots of work on recognizing single objects and understanding the scene as a whole. However, research goals have now shifted to figuring out the relationship between multiple objects in a single image. Take Detecting Visual Relationships with Deep Relational Networks for example. This work proposes an integrated framework not only for classifying single objects, but also exploring the visual relationship between different objects.

Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition. This paper talks about two challenges for fine-grained image recognition: discriminative region localization, and fine-grained feature learning. To deal with those problems, the authors introduce the Recurrent Attention Convolutional Neural Network to utilize attention mechanism, such that it will iteratively take closer looks at target objects, to discriminate between tiny differences.
Annotating Object Instances With a Polygon-RNN. This paper received an honorable mention for the best papers award. This work creatively formulates object annotation problem as polygon prediction, rather than traditional pixel labeling. Obtaining data in a fast fashion is critical when data size becomes the bottleneck of Deep Learning, and their work supplies researchers with a flexible annotation method.

We also found an interesting work during the poster session:

Automatic Understanding of Image and Video Advertisements. Advertisements implicitly persuade customers to take certain actions. Understanding ads require more than recognizing physical contents. This work covers 38 topics and 30 sentiments, with symbolism linking physical content to abstract concepts.

Analyzing Humans

Due to the increasing threat to public safety, the needs of person identification and pedestrian detection are growing quickly. Thankfully, lots of applications and extended theories related to this fields continue to emerge.

These two papers were very well received with big rounds of applause during their presentations:

Person Re-Identification in the Wild. Previous work focuses on person re-ID only, while this work simultaneously combines the person detection and person re-ID. They propose ID-discriminative Embedding (IDE) as it is easy to train and test. The insights on how detection helps person re-ID include:
- Evaluating detector performance under re-ID applications;
- A cascade IDE fine-tuning strategy: first fine-tune detection, then fine-tune re-ID.
Recurrent 3D Pose Sequence Machines. Due to the large variations in human appearance, arbitrary camera viewpoints and obstructed visibility, and the inherently ambiguous, 3D pose estimation is much more challenging than 2D problems. This paper proposes a novel Recurrent 3D Pose Sequence Machines (RPSM) model that learns to recurrently integrate rich spatial and temporal long-range dependencies using a multi-stage sequential refinement.

However, as cameras target people in their daily life, privacy becomes another hot topic: “I saw lots of works emerging, and these can be a huge challenge to supervision departments. For tasks such as person identifying, personal privacy may be at stake,” a concerned scholar said.

Research Trends & Observations

Machine Learning and Deep Learning in Computer Vision. There are also different voices: a scholar from EE said “I don’t think the combination of computer vision and deep learning is very good, although it produces so many successful applications and papers. Traditionally, from the perspective of signal processing, we know the physical meanings of computer vision, for example, Scale-Invariant Feature Transform (SIFT) and Speeded Up Robust Features (SURF) methods, but deep learning leads such meaning nowhere, all you need is more data. This can be seen as a huge step forward, but may be also a backward move. We need to re-evaluate our methods from rule-based to data-driven.”
Data-Driven Model. Models will no longer be designed by hand-design pattern (this method often covers one or two features of a specific dataset, but performs badly on other datasets), but focus on the data-driven models, which means features are learned from the thousands of new images it is fed. Some images may be highly corresponding to each other (but from different angles) so the model could learn the similarities itself by measuring the correspondence (for localization problem). In short, massive data leads to better results. But a simple algorithm with massive amounts of data cannot be the best in the future. The successful model is built by a strong enough algorithm trained on the high quality and large enough dataset. What’s more, it needs to be applied to the proper scene.
Datasets
1. Problem: in 2D & 3D Vision, many new type of researches often encounter the problem of no suitable (enough) training data.
2. Methods:
  1. finding new methods to generate or augment training data, some are weakly supervised or self-supervised learning
  2. put needs on Amazon Mechanical Turk, etc.
3. Conclusion: ImageNet has dominated Computer Vision since 2009, and most models are trained on ImageNet. Now, data has become the bottleneck of advanced algorithms, and it is inevitable for researchers to build a larger general purpose dataset. Furthermore, the quality of data is also important, since low-quality data may severely lower the performance, even if the model may be good enough. The supervised method cannot meet the data requirements, and the community needs to find a new way out. For example, unreliable data, weakly supervised methods, and active learning in the environment may be plausible directions of the next wave.
Weakly supervised methods. There are nearly 30 papers talking about weakly supervised methods. This trend is closely related to the problems of insufficient data. The term ‘weakly supervised’ means an image with incomplete labels. The labels are not well bounded on the objects in the image, for example, a bounding box of the car with the label on the side, but fed into models without processing. This trend reflects the inaccessibility of human-labelled dataset.
Coupling data and model. This will be the future trend. The fundamental problem of current research is that there is no longer enough data for advanced algorithms or models for special applications. So many researchers’ outputs consist of not only algorithms or architectures, but also datasets or methods to amass data.

Tutorials, Workshops, and Challenges

Among all the workshops, more than 14 of them are challenge driven. Many company labs and university research groups have participated in the challenges. Aside from publishing research papers, the challenges have become another venue for research groups and companies to demonstrate their researches and engineering abilities.

Challenges at CVPR 2017

ActivityNet Large Scale Activity Recognition Challenge 2017

Beyond ImageNet Large Scale Visual Recognition Challenge

2nd NTIRE: New Trends in Image Restoration and Enhancement workshop and challenge on super-resolution

The Bright and Dark Sides of Computer Vision: Challenges and Opportunities for Privacy and Security

The DAVIS Challenge on Video Object Segmentation 2017

Visual Question Answering Challenge 2017

YouTube-8M Large-Scale Video Understanding Challenge

Visual Understanding of Humans in Crowd Scene and the 1st Look Into Person (LIP) Challenge

Joint workshop on Computer Vision in Vehicle Technology and Autonomous Driving Challenge

Faces “in-the-wild” Workshop-Challenge

Joint Workshop on Scene Understanding and LSUN Challenge

Traffic Surveillance Workshop and Challenge

PASCAL IN DETAIL Workshop Challenge

Challenge on Visual Understanding by Learning from Web Data

Joint Bridges to 3D Vision Workshop and NRSfM Challenge

…

The ImageNet, initiated by Dr. Fei-Fei Li, was one of the most renowned challenges in the fields of computer vision. During the CVPR 2017, Dr. Fei-Fei Li and Dr. Jia Deng gave a talk about the 8 years that ImageNet has achieved and also announced that Kaggle will take over ImageNet.

Which challenge to watch for evaluating computer vision algorithms at large scale after ImageNet? ‘WebVision is the most promising one.’ AI Researcher Weilin Huang said at this year’s WebVision (Challenge on Visual Understanding by Learning from Web Data)

在CVPR研讨会上，李飞飞教授作为谷歌研究院代表暨比赛赞助方，向码隆科技算法团队颁发了WebVision冠军奖项.jpg

At the CVPR workshop, Malong was given the WebVision Award by Professor Li Fei-Fei, on behalf of the sponsor for this competition, Google Research

The main differences between WebVision and ImageNet can be divided into two parts:

Unbalanced class distribution: the class distribution of WebVision depends on queries, which means common objects are more likely to show up, while the class distribution of ImageNet remains similar.
Noisy data: all images of WebVision come from queries to Google and Flickr, while those of ImageNet are all human-labelled. So incomplete and wrong labels may hinder the training process on the WebVision dataset.

To deal with the two problems above, Malong adopted a paradigm that is not new yet rarely used before, called Curriculum Learning, which was first proposed by Bengio in ICML 2009. Malong believes that Curriculum Learning, which trains CNNs using samples with increasing complexities, can tremendously boost model performance. Since noisy data can be filtered and fed into the network in a specific order (illustrated in figure below), a fine-tuning can be implemented. Their architecture consists of a baseline model trained on meta-data, then trained on the Curriculum Designed dataset.

The way of designing the curriculum is to implement K-means on each class. By doing this, noisy images with wrong labels will be clustered, and other clusters with right or intermediate labels remain in descending relevance order. With the curriculum well designed, we can see that each cluster is inherently bonded with different complexities for the further training process.

image (1).png

Then models are trained following the Curriculum Learning paradigm: feeding clean data first then noisy data.

image (2).png

At last, Malong’s architecture with Curriculum Learning won the WebVision challenge with comparable results.

image (3).png

Besides challenges, invited talks and panel discussions are also presented with most of the workshops.

One of the most promising future industry direction at this year’s CVPR is autonomous driving. On the 1st day of workshops, there was a Joint workshop on Computer Vision in Vehicle Technology and Autonomous Driving Challenge. The morning half was the Computer Vision in Vehicle Technology workshop. Invited guest speakers talked about their vision and shared their experience in the fields (http://adas.cvc.uab.es/cvvt2017/invited-talks/). NVIDIA sponsored the best paper award, and it was awarded to The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for semantic segmentation. Simon Jegou, MILA; Michal Drodzal, Imagia; David Vazquez, Computer Vision Center; Adriana Romero, MILA; Yoshua Bengio, MILA. The afternoon half was the ‘Autonomous Driving Challenge’, which consisted of 5 invited talks and a panel. Most of the talks are not pure academia driven, which made the workshop special.

We talked with Fisher Yu, a Ph.D. student from Stanford University. He is one of the organizers of the CVPR 2017 Workshop on Autonomous Driving Challenge. According to Fisher, the motivation of organizing such a workshop and challenge is to bridge the gap between industry and academia, thus the topics of invited talk were designed to cover both of them. The academia focuses more on how to eventually solve self-driving, while the industry is more practical about devoting engineering effort to solve some specific issues. The challenge workshop invited both high profile CV researchers and start-up industry leaders: Professor Alan Yuille from Johns Hopkins University who has made great contribution to solving vision problems; Andreas Geiger who is the lead of KITTI, one of the early vision-er working on defining self-driving problems, setting its corresponding database and benchmark; industry practitioners with strong academia background such as Dr. Xiaodi Hou from TuSimple, Dr. Jianxiong Xiao from AutoX, James Peng from Pony AI, and Dr. Jan Becker from Apex.AI.

During his own presentation, Fisher Yu introduced the Berkely DeepDrive project. They have developed Berkely Deep Drive Data (BDDD https://deepdrive.berkeley.edu/), which provides hundreds of thousands of hours of driving data. Most of the data were from mobile devices, such as camera, GPS, and IMU. The BDDD features instance level semantic segmentation and is well labeled. The BDDD team has developed an end-to-end driving policy, which was also presented at this year’s CVPR (https://arxiv.org/abs/1612.01079). They have also worked on improving the efficiency with a smaller model. The massive amount of data would require the model to be more efficient – smaller in size and less time in running/inferencing. Alan Yuille and Andreas Geiger are both interested in how to use simulated data to do research, and how to better research on and analyze 3D data. According to Dr. Xiaodi Hou, CTO of the challenge host TuSimple, the challenge focuses on lane detection and velocity estimation to fill-in the blank of missing benchmarks.

Industry Participation at Expo

This year, the CVPR received a total of 127 sponsors. The total number of sponsors usually drops when the conference is hosted in Hawaii. However, that is not the case this year, with 30% more sponsors than CVPR 2016. Most of the sponsors participated in the Industry Expo. When asked about the reason to join the Expo, most of them answered recruiting, while a few of them answered for marketing.

Author: Chain Zhang, Qintong Wu | Editor: Hao Wang | Localized by Synced Global Team: Xiang Chen

CVPR 2017: The Fusion of Deep Learning and Computer Vision, What’s Next?