Learning to Map Vehicles into Bird’s Eye View

This paper presents a deep learning architecture to map the vehicle detection in the frontal view onto a bird's eye view of the road scene, while preserving the size and direction of the moving vehicle.


1. Introduction

Awareness of the road scene is essential to autonomous driving, and serves as an important component of the Advanced Driver Assistance Systems (ADAS). Vision-based algorithms have been integrated into a number of ADAS, and there are three major paradigms, namely: mediated perception approaches (which are based on the understanding of the surrounding of a vehicle), behavior reflex methods (where the driving action is directly controlled by the sensory input ), and direct perception techniques (generating a mapping from the input image and a summary of the driving situation in various forms, such as a bird’s eye view of the road).

This paper is inspired by direct perception techniques for ADAS, and presents a deep learning architecture to map the vehicle detection in the frontal view (or from a dashboard camera view) onto a bird’s eye view of the road scene, while preserving the size and direction of the moving vehicle, as illustrated in Figure 1.


2. Dataset

Since it is difficult to obtain a real-word dataset containing both the frontal view and bird’s eye view of the road scene, the authors compiled an annotated synthetic dataset and called it the GTAV dataset. It contains over 1 million frontal-bird’s eye pairs, using Script Hook V Library ( ) that allows access to the functions of the video game Grand Theft Auto V (GTAV) ( ). The frontal views and bird’s eye views of the road condition were captured in an interchange manner, having the game camera toggle between the frontal and bird’s eye view at every game time step. Samples from the GTAV dataset are shown below in Figure 2.


The authors also made the GTAV dataset, code, and pre-trained model available for download at:

Every entry in the GTAV dataset makes a tuple of the following attributes:

  1. Frames captured from the frontal and bird’s view camera, in the resolution of 1920×1080
  2. The identifiers of vehicles in the frames and the corresponding type of vehicles (for example, a truck)
  3. Coordinates of the bounding boxes in both frontal and bird’s view frames. The bounding boxes are used for enclosing the vehicles detected in the frames
  4. The distance and orientation of the vehicle with respect to where the frontal and bird’s views are captured

Parameters as noted by the authors below:


3. Model

The authors pointed out that the problem of mapping from a frontal view to a bird’s eye view of the road condition could be mistaken for “a geometric view warping between different views”. However, a number of objects existing in the bird’s eye view are invisible in a frontal view, making it challenging to deal with the problem as if it required only a straightforward warping between views. Moreover, in the testing stage, no bird’s eye views would be given, thus the problem cannot be treated as a correspondence problem between views.

Thus, the authors tried to solve the problem from the perspective of deep learning: learning an occupancy map of the road from above the vehicle (a bird’s eye view), given the frontal views of the road captured by dashboard cameras.

The proposed deep architecture named Semantic-aware Dense Projection Network (SDPN) consists of two main branches as demonstrated in Figure 4.


Branch 1 takes image crops of vehicles, detected by the dashboard camera, as input. The semantic features of input images are extracted, using ResNet50 deep network [1] pre-trained on ImageNet [2], without the fully-connected dense layer originally tailored for the image classification task. The goal of Branch 1 is to identify the vehicles that should appear in the output view, and the vehicles’ semantic type (for example, is it a truck or a car?).

Branch 2 aims at encoding the coordinates of input bounding boxes (four coordinates for each bounding box) into a 256-dimensional feature space.

Semantic features of the input images captured by the frontal camera (through Branch 1) , as well as encoded coordinates of the bounding box(es) seen in the frontal views (through Branch 2) are then fused via the concatenation. A coordinate decoder is responsible for predicting the coordinates of vehicles constrained by bounding boxes in the frontal views in the resulting output bird’s eye view.

Training details can be found in the corresponding lines of the two screen shots (as shown below) of the scripts shared on GitHub by the authors (

  1. Mean pixel value of ImageNet [2] is deducted from the input crops, which are resized to 224×224 prior to the training
  2. Parameters of ResNet 50 Deep Network [1] are fixed during training
  3. Ground truth coordinates in the output bird’s view are normalized in the range [-1,1]
  4. Dropout is applied after each fully-connected layer (as described in Branch 2 of the proposed SDPN architecture) with a drop probability of 0.25
  5. The model is trained end-to-end using Mean-Squared Error (MSE) as loss function, and Adam optimizer [5] with parameters as follows: lr = 0.001, beta_1 = 0.9, beta_2= 0.999
( Source: )
( Source: )

4. Experimental Results

The performance of the proposed SDPN architecture was assessed against three baseline models as follows:

Baseline 1 : Homography

The goal is to evaluate the choice of a deep learning perspective, rather than treating the problem as a task of geometrical transformation between views.

The homography approach adopts a projective transformation to calculate a mapping between corresponding points in two views, with the points collected from the bottom corners of both “source” and “target” bounding boxes in the training set. They are used for the purpose of estimating an “homography” matrix in a least-squares manner in order to minimize the projection error. However, this baseline approach cannot recover the height of a target bounding box, thus the average height of training samples is then propagated to the target box.

Baseline 2: Grid

The grid baseline approach quantizes the spatial locations in both the input and output views using a regular grid (with a fixed grid resolution of 108×192) in a probabilistic fashion:

For each cell :

birdsin.pngin the input grid (frontal view), the task is to estimate a probability over all the grid cells

birdsoutin the bird’s view, so as to find the corresponding location of every pixel in the input grid

Baseline 3: Multi-Layer Perceptron (MLP)

The purpose of comparing the proposed SDPN architecture to Baseline 3 approach is to determine the importance of semantic features used in SDPN, since it might be possible that we predict the output coordinates of bounding boxes, given enough training coordinates of input boxes and a powerful model.

Therefore, the authors trained an additional model (Baseline 3) with approximately the same number of parameters adopted in the SDPN architecture, and fully connected from input to output coordinates. We might consider Baseline 3 as a reduced version of the proposed SDPN architecture, without Branch 1 in charge of extracting features of the vehicles constrained by the bounding boxes.

Three metrics are used for the performance evaluation of the proposed SDPN architecture, compared to three baseline approaches as follows:


Figure 5 (a) presents the results of the proposed SDPN architecture compared to three baselines. As pointed out by the authors, homography and grid are “too naive” to map from the frontal view to the bird’s eye view. While MLP provides reasonable estimations of the bounding box coordinates, but fails in recovering the shape of the bounding box in the bird’s eye view, given that it does not consider the visual features in the bounding box, resulting in an unsatisfactory performance in terms of centroid distance(CD) .

The proposed SDPN architecture manages to capture the semantic of an object, serving as cues for estimating both the location and shape of the object(vehicle) constrained by the bounding box in the target view.

The second experiment is to evaluate how the distance to the detected vehicle affects the mapping accuracy. As shown in Figure 5 (b), the performance of all models drops as the distance increases, suggesting that it is easier to model a detected vehicle when it is closer, since closer examples have lower variance (for example, they are most likely to be the vehicle ahead and those approaching from the opposite direction). Another observation is that the accuracy gap between MLP and SDPN widens as the distance increases, implying that the visual features indeed contribute to the mapping accuracy and robustness of the proposed SDPN architecture.


Figure 6 provides a qualitative comparison between the proposed SDPN architecture and three baseline model. As seen in Figure 6, baseline approaches fails in making reasonable estimations of the location of bounding boxes, while the SDPN architecture excels in predicting the position of bounding boxes, as well as recovering the orientation and shape of the vehicle in the bird’s eye view.


The proposed SDPN architecture is also tested on a real-world driving video dataset [3], taken by a roof-mounted camera. The bounding boxes in the frontal view are generated using a state-of-the-art detector proposed in [4].

Only qualitative results are shown in Figure 7, since the ground truth is not available for these videos. The predictions of bounding boxes appear to be reasonable despite that the deep network is only trained on synthetic data, suggesting that the SDPN architecture is capable of being extended to real-world data.



The paper studies an interesting question: how to map from a frontal view to a bird’s eye view of the road, while showing all the vehicles detected by the dashboard camera?

The authors contributed a synthetic dataset (GTAV dataset) of over one million frontal-bird’s eye view pairs, and further proposed a deep network architecture (called SPDN), which is demonstrated to be effective in mapping a frontal view to a bird’s eye view of the road scene, while preserving the size and shape of the detected vehicle(s) by the frontal camera.


[1] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770-778 (2016)

[2] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. pp. 248{255. IEEE (2009)

[3] Alletto, S., Palazzi, A., Solera, F., Calderara, S., Cucchiara, R.: Dr (eye) ve: A dataset for attention-based tasks with applications to autonomous and assisted driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 54-60 (2016)

[4] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: European Conference on Computer Vision. pp. 21-37. Springer (2016)

[5] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR abs/1412.6980 (2014),

Author: Olli Huang | Technical Reviewer: Haojin Yang

About Synced

Machine Intelligence | Technology & Industry | Information & Analysis

0 comments on “Learning to Map Vehicles into Bird’s Eye View

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: