For an engineer, a practical way to make an autonomous vehicle is not by programming a car to drive in any environment (which would have a nearly infinite number of possible variables), but by showing the car how to drive and make the car learn by itself. NVIDIA created a system of this kind, named PilotNet. It is trained by road images paired with steering angles from a human. And the road test result shows, this kind of neural network can perform very well.
However, there is one problem: you don’t know what PilotNet actually learned and how it make its decisions. In this paper, the authors provide an approach to determine what exactly in the road image most influences PilotNet’s steering decision.
The authors provided a brief introduction to PilotNet, a learning system with a trained convolutional neural network. The training data comes from a front-facing camera mounted on a data collection car. It is also coupled with the human driver’s time-synchronized steering angle. As describe above, there exists an explainability problem. If researchers wants improved performance, they need to know how PilotNet makes decisions. Hence, the authors developed a way to highlight the most salient part of an image in determining steering angles. It was then executed in their test car’s NVIDIA DRIVETM PX 2 AI car computer. It shows us the details about training data for the PilotNet Self-Driving System. The individual images in dataset are from the front-facing camera, and they are paired with corresponding steering commands. After training, the system will react to new images by providing output steering commands.
PilotNet Network Architecture:
The architecture of PilotNet is shown below in Figure 1.
It contains 9 layers, with one normalization layer, 5 convolutional layers and 3 fully-connected layer. The first layer is for image normalization, so it’s hard-coded and not adjusted in the learning process. The convolutional layers are for feature extraction. For the first three convolutional layers, the authors used the strided convolutional layers with a 2×2 stride and a 5×5 kernel. In the last two convolutional layers, they are non-strided layers with a 3×3 kernel size. The three fully-connected layers are used as a controller for steering.
Finding the Salient Objects:
The authors show 6 processes to find the salient object, and presented it in the higher-level maps:
- The authors average each layer’s activation of feature maps.
- By using de-convolution, the top most averaged map is scaled up to the size of the map in the layer below.
- The layer below multiples the up-scaled map and we can get an intermediate mask.
- Perform the step 2 again.
- Perform the step 3 again and get the new intermediate mask.
- Repeat the step 4 and 5 and get the final visualization mask.
The process diagram is shown in Figure 2.
We can see the process to create the visualization mask in Figure 3. The salient objects are shown by highlighting the pixels in the original image.
The examples are shown in Figure 4. The top image shows the base of cars and the lines being highlighted. The middle image highlights parked cars. In the lower image, the grass is highlighted.
Figure 5 illustrates the PilotNet monitor. We can see there is a comparison between the original image and the masked one. Figure 6 provides a close up look at the monitor.
To illustrate these highlighted objects really helps steering control. The authors segment the input images into two classes and conduct a series of experiments. Class 1 includes the parts that have an important effect on the steering angle, and Class 2 includes all pixels that are in the original image minus the pixels in Class 1. Here it uses the following method to test whether this highlighted region has a significant effect: The authors change Class 1 and keep Class 2, and use the new image as input to PilotNet, which will generate significant change. Then, the authors change Class 2 and keep Class 1, and use this new image as input to PilotNet, which will generate minimal change in PilotNet’s output. Figure 7 illustrates the method. The top image is the original image, and the second image shows the results obtained from PilotNet whose input is the top image. The third one shows the dilated salient regions, and the bottom image shows the test image where the dilated salient objects are shifted.
Figure 8 shows the plots of PilotNet steering output which is a function of pixel shift in the input image. We can see that shifting the salient objects’ results are similar to shifting the entire image, and shifting the background pixels’ results are different. Thus, the authors can make sure they got a correct result in finding the most important regions of an image.
This article shows a method to find the regions with a significant impact on making steering decisions, as well as a way to test itself. This also means self-driving is not only about using AI technology to make the computer learn by itself, but also about building an approach to test the learning results.
Author: Shixin Gu| Localized by Synced Global Team: Zhen Gao