AI Machine Learning & Data Science Research

How to speedup 31*31 conv 10 times

MegEngine Teams propose large-kernel convolution optimization strategy, speeding up 31*31 convolutional neural networks 10 times.

Why using convolution with such a large kernel size?

Vision Transformer (ViT), since proposed in 2021, has become a promising research topic in the realm of deep learning in computer vision. As the study of ViT goes deeper, some researchers are enlightened by the large receptive field of ViT. If it is the large receptive field that leads to the effectiveness of ViT, the same concept could be borrowed onto the CNN structure, by simply making the convolutional network with a larger receptive field. According to the effective receptive field (ERF) theory, the ERF size is proportional to the kernel size and the square root of the model depth. This means that achieving a large ERF by stacking layers is not as effective as increasing the convolutional kernel size. Therefore, researchers propose a new CNN structure (paper link) featuring large convolutional kernels. This network achieves comparable or superior accuracy against ViT and its variations on ImageNet and on some object-detection, segmentation challenges. “Large-kernel CNNs can outperform the small-kernel ones.” This could potentially bring a sea change to the evolution path of the CNNs.

But large kernel brings significate more computation and parameters

The most common convolutional kernel sizes in CNNs are 3×3, 5×5, 7×7. In this article, we identify the convolutional kernels as “large” ones if their sizes exceed 9×9. As derived in math, the size of a convolution kernel increases n times, the number of parameters and the floating-point operations (FLOPs) will approximately increase n square times. This is the main reason why researchers turned away from large kernel convolutions in the first place when exploring the new structures of CNNs. As illustrated in the figure below, convolving channel-by-channel (a.k.a. depthwise convolution) can reduce the FLOPs and the number of parameters to 1/(number of input channels) of the ones of dense convolution. That’s why most researchers design the large-kernel convolutions as depthwise convolutions to obtain the benefits of the large-kernel convolutions without having an enormous increase of the parameters and FLOPs.

But sadly, under the same FLOPs limit, researchers found their neural network with the large,depthwise convlution, runs significantly slower than the old-school one.

However, We, MegEngine Teams, found that with proper optimization, depthwise convolutions cost almost nothing when kernel size become larger and larger. Let’s see why.

How to find the optimization room for the large-kernel convolution?

Roofline Model

To answer this question, we need to introduce the Roofline model as the background. As shown in the figure below, Roofline model tries to estimate how fast an application can run on a given computing device, by showing inherent hardware limitations, the potential benefit, and the priority of optimizations.

图 2
  • Theoretical Peak TP : the upper bound of performance of a computing device. Specifically, it refers to the maximum number of floating-point operations per second that a computing device can perform, measured in the unit of FLOPS.
  • Maximum Bandwidth B: the upper bound of the memory bandwidth of a computing device. Specifically, it refers to the maximum number of memory swaps per second that a computing device can complete, measured in the unit of Byte/s.
  • Maximum Computational Density (or Operational Intensity) IM: the maximum number of operations per memory swap that a computing device can perform, measured in the unit of FLOPS/Byte (notice here is FLOPs not FLOPS, meaning number of Float Operation).

“Roofline” refers to the shape of the plot of TP against IM. The theoretical peak of the device determines the height of the “roof” (blue line). Meanwhile, the slope of the “roof” (red line) is the maximum access bandwidth of the device. The joint of the two colored lines is also the separation of the two types of bottlenecks of application, Compute Bound and Memory Bound.

When the application’s computational density I exceeds the maximum IM, its performance can only reach no higher than the theoretical peak TP of the computing device. Then, the maximum performance P of the application is apparently not proportional to the computational density I. This type of application is called Compute Bound. When the application’s computational density is less than IM, the performance is determined by both the maximum bandwidth of the device and the computational density. This type of application is called Memory Bound. In contrast to the Compute Bound scenario, increasing the device bandwidth or increasing the computational density of the Memory Bound applications can lead to a linear increase in the application’s performance.

Dense convolution: Just can’t improve anymore

Nowadays, we have explored a lot of optimization techniques for dense convolution, including Direct, im2col/implicit GEMM, Winograd and FFT.

let’s analyze it in the context of the Roofline model. In this article, we choose a NVIDIA 2080Ti GPU as the computing device, which has a measured L2 cache bandwidth of 2.16TB/s and a theoretical peak performance of 4352 FFMA Cores * 1.545 GHZ * 2 = 13.447 TFLOPS. We assume that the output data for each thread in CUDA is accumulated in registers and that the L1 cache hits 100%. We also ignore the process of writing back to the output. Since the design of modern computing devices is capable enough to support many time-consuming access operations concurrently in the actual convolutional computation, we also assume that the L2 cache hits 100% and that the maximum bandwidth of the L2 cache is reached. The input shape of the convolution used in this article is (n, ic, ih, iw). The kernel is (oc, ic, kh, kw), and the output is (n, oc, oh, ow).

im2col/implicit GEMM are most classical optimization methods for dense convolution, you may checkout our previous artical about implement convolution with TensorCore in MegEngine. After the im2col transformation, we converted the convolution into a matrix multiplication problem, where 𝑀=oc𝑁=𝑛×oh×ow𝐾=ic×kh×kw, as shown in the following figure.

图 3.png

Matrix multiplication has been optimized really well in computing libraries such as cuBLAS. The performance is close to the theoretical peak of the device especially when matrix is large enough. We hereby give a brief analysis on the performance using the Roofline model.

In order to take advantage of the hardware architecture, we normally divide the calculation of the matrix multiplication into blocks, so that multi-level storage can work at full capability to get the maximum memory access bandwidth.

As shown in the figure below, given each Thread Block in CUDA processes the output of BM×BN

  • kernel block size is BM×BK
  • input block size is BK×BN
  • So, The number of computation is BM×BN×BK×2
  • The memory access is (BM×BK+BN×BK)×4
  • The computation density is BM×BN×2/(BM+BN)×4

According to the description of Roofline model, the IM of the device is calculated as IM=TP/𝐵=13.447/2.16=6.225 FLOPs/byte. To reach the theoretical peak of the device, we just need to ensure that the computational density is greater than IM. Suppose BM=32BN=32, then the computational density will reach 8 FLOPs/byte, which is obviously greater than IM. Apparently, this application falls into the Compute Bound region, which means there is no room for performance optimization. We can definitly get the old conclusion from VGG era: “as the kernel size grows, time usage increases squarely”.

图 4.png

Depthwise conv: Not a same case

The number of computation grows squarely as kernel size grows, no matter dense or depthwise. Since the performance trend of cuDNN on depthwise conv does show similar conclusions, many people think that the previous conclusions on dense conv are also valid on depthwise conv.

That’s FALSE GENERALIZATION of a conclusion, depthwise conv does not share same precondition.

First, let’s analyze depthwise convolution by im2col/implicit GEMM method. Since depthwise conv calculates channel-by-channel, it can be viewed as a group of single-channel convolutions, channel number equals to the group size. After the im2col transformation, we will get a Batched GEMV, for each batch as shown in the figure below.

图 5.png

Let’s keep same chunk size with the dense one, the GEMV of each batch is shown in the figure below. The computational density at this point is BN×2/(1+𝐵𝑁)×4=BN/(2×BN+2). Given BN=1, the maximum computational density is 0.25 FLOPs/byte, which is far less than the IM 6.225. It means we are in memory-bound zone now. Although there’re some ways to make GEMV faster, the layout of “vector x matrix” is destined to be memory-bound application.

图 6.png

Next, we will analyze Direct method. For example, we make all 32 threads of a warp compute the output of oh*ow, with a kernel size of kh*kw:

  • Number of floating-point operations = oh×ow×khkw×2 FLOPs
  • Number of memory access = (kh×kw+(oh+kh−1)×(ow+kw−1))×4 bytes
    • kernel size: kh×kw
    • input size: (oh+kh−1)×(ow+kw−1)
  • Computational density = (oh×ow×kh×kw×2)/{(kh×kw+(oh+kh−1)×(ow+kw−1))×4}

Let’s walk a little further on this example. Let’s each thread compute 4 outputs, then one warp can compute 4×32 outputs when the kernel size is 3×3. Accordingly, the computational density is (4×32×3×3×2)/(3×3+6×34)×4=2.7 FLOPs/byte. The maximum performance is 2.16*2.7 = 5.84 TFLOPS, which is far from the theoretical peak of 13.447 TFLOPS. An idea is that we could increase the number of output computed by each warp, to increase the computational density. However, the workload of each warp cannot be increased indefinitely due to the output size of the convolution itself and the limited computational resources such as the register file in each stream multi-processor.

To wrap up our findings in both im2col and direct method: depthwise convolution, unlike dense convolution, is basically a Memory Bound operation. Increasing kernel size doesn’t change the amount of memory access a lot, the time of computing should remain the same until it reach compute bound state. Under proper optimization, it should be almost free to increase kernel size of depthwise conv.

The Real Performance in MegEngine

Different from theoretical analysis, we found that the performance of cuDNN is really bad as the kernel size increasing in our profiling.

The setting for this profiling

  • input shape: (64, 384, 32, 32)
  • output shape: (64, 384, 32, 32)
  • device: 2080Ti

That’s why MegEngine do intensive optimization on large-kernel depthwise conv, the computation time remain the same during the kernel size increasing as predicted by theory. Comparing to PyTorch, the training time usage is only 10% when using MegEngine.

Install MegEngine and try it right now!

Installation path and more information: https://github.com/MegEngine/MegEngine welcome star 😀

Contributor: MegEngine team


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

0 comments on “How to speedup 31*31 conv 10 times

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: