A Brief Review of FlowNet

Recently, CNNs have been successfully used in estimating optical flow. Compared with traditional methods, these methods achieved a large improvement in quality. Here, we will give a brief review on the following papers.

image (8).png

image (9).png

image (10).png


Convolutional neural networks (CNNs) have made great contributions to various computer vision tasks. Recently, CNNs have been successfully used in estimating optical flow. Compared with traditional methods, these methods achieved a large improvement in quality. Here, we will give a brief review on the following papers.

Both FlowNet1.0 and FlowNet2.0 are end-to-end architectures. FlowNet2.0 is stacked by FlowNetCorr and FlowNetS, and has much better results than both of FlowNetCorr and FlowNetS. FlowNetS simply stacks two sequentially adjacent images as input, while in FlowNetCorr, two images are convoluted separately, and are combined by a correlation layer. In a spatial pyramid network, the authors trained one deep network for each level independently to compute the flow update. Both the SPyNet and FlowNet2.0 estimate large motions in a coarse-to-fine manner. FlowNet2.0 has the best performance among these architectures, and SPyNet has the least model parameters.

FlowNet: Learning Optical Flow with Convolutional Networks

In FlowNet1.0, the paper proposed and compared two architectures: FlowNetSimple and FlowNetCorr. Both of the two architectures are end-to-end learning approaches. In FlowNetSimple, as shown in Fig.1, the authors simply stack two sequentially adjacent input images together and feed them through the network. Compared with FlowNetSimple, FlowNetCorr(Fig.2) first produce representations of the two images separately, and then combines them together in the ‘correlation layer’, and learn the higher representation together. Both of the two architectures have refinements which are used for upsampling resolution.

image (11).png
Fig. 1
image (12).png
Fig. 2

Correlation layer is used to perform multiplicative patch comparisons between two feature maps. More specifically, given two multi-channel feature maps f1, f2, with w, h, and c being their width, height and number of channels. The ‘correlation’ of two patches centered at x1 in the first map and x2 in the second map is then defined as:

image (13).png

where x1 and x2 are the center of the first map and the second map respectively, and the square space patch of size K = 2k+1. Furthermore, for computation reasons, the authors limits the maximum displacement. To be specific, for each location x1, the authors limit the range of x2 by computing correlations in a neighborhood of size D = 2d+1, and d is a given maximum displacement. The size of an output is (w*h*D^2). Afterwards, the authors concatenate the feature map, which is extracted from f1 using convolution layer, with the output.

However, after a series of convolution layers and pooling layers, resolution has been reduced. Thus, the authors refine the coarse pooled representation by ‘upconvolution’ layers, consisting of unpooling and upconvolution. After upconvolutioning the feature maps, the authors concatenate it with corresponding feature maps and an upsampled coarse flow prediction. As is shown in Fig.3

image (14).png
Fig. 3

Actually, the model provided by the authors on Github is slightly different than the above figure. The second box of Fig.3 not only consists of feature maps from deconv5 and con5_1, but also flow6 which is generated by the following flow.

conv6 —-(conv)—–>conv6_1—(conv)—>predict_flow6—-(conv)—–>flow6

Table 1. shows average endpoint errors (in pixels) of different methods on different datasets.

image (15).png

FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks

Intro and Contribution

FlowNet2.0 is much better than FlowNet1.0. Compared with FlowNet1.0, FlowNet2.0 has a large improvement in quality as well as speed. The main architecture is shown in Fig.7. This paper has four main contributions:

1. The schedule of presenting data is significant in training progress
2. Proposed a stacked architecture
3. Introduced a sub network specializing on small motions
4. Proposed a fusion architecture

Datasest Schedules

In the experiments, not only is the kind of training data important to the performance, but also the order it is presented in during training. Authors tested FlowNetS and FlowNetCorr, respectively, on Chairs and Things3D. An equal mixture of samples from both datasets using the different learning rate schedules is shown in Fig.5. S_short,S_long and S_fine are different learning rate schedules, as shown in Fig.6. The numbers in Fig.5 mean end-point errors on Sintel dataset. From Fig.5, we can know that the best result is first training on Chairs and then fine-tuning on Things3D.

image (16)
Fig. 5
image (17).png
Fig. 6

Stacking Networks

To compute large displacement of optical flow, the authors stacked FlowNetS and FlowNetCorr as is shown in Fig.7. Table 3 shows the effect of stacking FlowNetS by which the best FlowNetS in Fig.5 has been applied. The first FlowNetS gets the image I1 and I2 as input, and the second FlowNetS takes image I1, flow wi computed by the first FlowNetS, image I2 warped by flow wi and the brightness different error between image I1 and image I2 warped by the flow wi.

There are two methods in training stack architectures: fixing weights of the first network, or updating them together with the second network. The results are shown in Table3, from which we see that the best results on Sintel are obtained when fixing Net1 and training Net2 with warping.

Also, authors did experiment with stacking multiple diverse networks, and they found that stacking a network with the same weights multiple times and also fine-tuning this recurrent past doesn’t improve the results. So, they add networks with different weights, and each new network is first trained on Chairs and fine-tuning on Things3D. Finally, they conducted FlowNet2-CSS by balancing networks accuracy and run time. FlowNetCorr is the first network of FlowNet2-CSS and followed by two FlowNetSs, as the first stream shown in Fig.7.

image (18).png

Small Displacement Network and Fusion

However, for small displacement, FlowNet2-CSS is not reliable. Thus, the authors created a small dataset with small displacement, and trained FlowNetSD in this dataset. FlowNetSD is a little different than FlowNetS. They replaced the 7*7 and 5*5 kernels in the beginning with multiple 3*3 kernels, and removed the stride 2 in the first layer. Finally, the authors introduced a small and simple deep network (Fusion) to fuse the output of FlowNet2-CSS and FlowNet2-SD, as is shown in fig.7.

image (19).png
Fig. 7


Table 4 shows the performance on different benchmarks. AEE: Average Endpoint Error; Fl-all: Ratio of pixels where flow estimate is wrong by both 3 pixels and 5%. In Sintel, Sintel final and Middlebury, FlowNet2 surpasses all the other reference methods in terms of accuracy, and it also performs well in other datasets with relatively high accuracy rate.

image (20).png

Optical Flow Estimation Using a Spatial Pyramid Network


This paper proposed a new optical flow method by combing a classic spatial-pyramid formulation with deep learning. This is a coarse-to-fine approach. At each level of the spatial pyramid, the authors train a deep neural network to estimate a flow instead of solely training one deep network. This method is beneficial to arbitrarily large motions, because each network has less work to do and the motion at each network become smaller. Compared to FlowNet, SPyNet is much simpler and 96% smaller in terms of model parameters. Also, for some standard benchmarks, SPyNet is more accurate than FlowNet1.0.



image (21).png
Fig. 8

A 3-level pyramid network is shown is Fig.8:

  • d()is the downsampling function that decrease an m*n image I to m/2*n/2
  • u() is the resampling function that resample optical flow field
  • w(I,V) is used for warpping image I, according to optical flow field V
  • {G_0,…,G_K} is a set of trained convolutional neural network
  • v_k is residual flow computed by convnet Gk at the k-th pyramid level

image (22).png

At the k-th pyramid level, residual flow v_k is computed by G_k using I_k1, the upsampled flow from the previous pyramid, and I_k2 which is warpped by upsample flow. Then, the flow V_k can be represented by

image (23).png

image (24).png
Convents {G_0,…G_k} are trained independently to compute the residual flow v_k. Also, the ground truth residual flows v^_k is obtained by subtracting downsampled ground truth flow V^_k and u(V_k-1). Authors train the networks by minimizing the average End Point Error(EPE) loss on the residual flow v_k as is shown in fig.6.

image (25).png


image (26).png

Personal Perspectives

Compared to Flownet 1.0, the reason for Flownet 2.0’s higher accuracy is that the network model is much larger by using stacked structure and fusion network. As for stacked structure, it estimates large motion in a coarse-to-fine approach, by warping the second image at each level with the intermediate optical flow, and compute the flow update. Therefore, this method reduces the difficulty of learning task at each level and makes contributions to large displacement. As for the fusion network, the authors introduce FlowNet2-CSS and FlowNet2-SD to estimate large displacement and small displacement, respectively. Then, the fusion network is intended to better fuse two kinds of optical flows learned from the aforementioned two networks, expected for improving the overall quality of the final predicted optical flow.

From my perspective, reusing feature-maps has made a difference in the good performance of FlowNet1.0 and FlowNet2.0. The authors concatenated estimated flow and the input, which have been upsampled in the current layer to feature-maps, obtained from the front layer of the same resolution as the upsampled input of the current layer as the input of the next deconvolution layer, therefore these feature-maps can be reused, with a little similar to DenseNet in design concept.

SPyNet is a stacked network as well. It is also suitable for handing large displacement by using a coarse-to-fine approach, similar to FlowNet2.0. The difference between FlowNet2.0 and SPyNet is that SPyNet is much smaller than FlowNet2.0, and each layer of SPyNet is a deep network trained independently. SPyNet has much less model parameters than FlowNet, because it is using the warping function directly, and the convent does not need to learn it.

In general, FlowNet2.0 has the best performance, while SPyNet is much more lightweight, with fewer parameters, faster, and can be used on mobile terminals.


FlowNet: Learning Optical Flow with Convolutional Networks link:
FlowNet2.0 Paper link:
Optical Flow Estimation using a Spatial Pyramid Network link:

Author: Ziyun Li | Editor: Haojin Yang | Localized by Synced Global Team: Xiang Chen

1 comment on “A Brief Review of FlowNet

  1. Spynet is smaller compared to Flownet2 but again it is trained on flying chairs. Does spynet require fewer training data compared to Flownet2?

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: