In recent years, machine learning researchers have achieved remarkable results by increasing the number of parameters of their models. Excited as we are for these meaningful steps forward, many of us are increasingly concerned about the extravagant growth of “the number of parameters”. As we all know, the number of parameters of SOTA models have increased several hundreds, even a thousandfold since ResNet came out in 2015, while the memory size of GPUs has barely increased. It means that to train large models nowadays, one has little choice but to use an enormous number of GPU cards. As a consequence, budget would become a much more critical premise of machine learning and other interdisciplinary research activities than it ever is works probably rely on their financial status more than they ever did to proceed. Developers of the machine learning framework MegEngine approach the problem of large model training not only through workload distribution across multiple GPUs, but also from a GPU-economical perspective, working on memory optimization techniques to get the most out of a single GPU.
As the dynamic computational graph is widely supported by many machine learning frameworks, GPU memory utilization for training on a dynamic computational graph becomes a key specification of these frameworks. In the recently released v1.4, MegEngine provides a way to reduce the GPU memory usage by additional computation using Dynamic Tensor Rematerialization (DTR) [1] technique and further engineering optimization, which makes large batch size training on a single GPU possible. On 2080Ti, the maximum batch size of networks such as ResNet50 and ShuffleNet can reach three times as large as the scenario without DTR.
This article below is presented by MegEngine developers, which gives a thorough introduction on how MegEngine optimizing GPU memory usage in a dynamic computational graph using DTR, specifically from an engineering perspective. You are welcome to get MegEngine from Github page and have a try.
1. Background
1.1 Computational graph
In machine learning, a neural network model can be essentially represented by a computational graph. Its training process involves three parts: forward propagation, backpropagation, and parameter update. Take y=wx+b as an example, its forward computational graph is shown in the following figure:

To get the output y, we first multiply the input x and the parameter w to get the intermediate result p. Then we add p and the parameter b together.

Backpropagation is to find the derivative of y with respect to w and b. First, it finds the derivative of y with respect to p (which is 1) and the derivative of p with respect to w (which is x). According to the chain rule, the derivative of y with respect to w is 1*x, which is x. Note that the intermediate result of forward propagation p is also used in the process of backpropagation. A common practice of many machine learning frameworks is to store the intermediate results in GPU memory simply for convenience purposes. It is not a problem until the size of the network to train becomes reasonably large. In such a case, storing these intermediate results will take up a considerable amount of GPU memory, which consequently brings down the maximum batch size allowed in training.
1.2 Memory optimization in static computational graph
Before dynamic computational graph emerges, static computational graph is a commonly studied state-of-art representation of machine learning models. The GPU memory optimization techniques based on it basically fall into three categories:
1. Static memory allocation. Since the entire computational graph can be obtained during compile time, it is possible to analyze the life cycle of each tensor within the graph. Tensors whose life cycles do not overlap can be allocated with the same space in GPU memory.
2. Gradient checkpointing (trading additional computation for GPU memory space). Only a few gradient checkpoints are set throughout the graph in order to store intermediate results. The unsaved intermediate results are released. During the backpropagation, if an intermediate result is needed but not found in the GPU memory, it will be restored by a sequence of computations from the nearest gradient checkpoints.

3. Memory swap (trading bandwidth for GPU memory space). The temporarily unused data is swapped from the GPU to the CPU, and swapped back when it is needed.

2. Memory optimization in dynamic computational graph and DTR strategy
2.1 Memory optimization in dynamic computational graph
The most obvious change from static to dynamic computational graph is that the global information of the graph can not be obtained in advance. It means that the life cycle of each tensor can not be correctly acquired anymore. So, static GPU memory allocation for the entire computational graph is no longer feasible. However, gradient checkpointing is still possible, except that the locations of the checkpoints is only optimal within the partially available graph. On the other hand, memory swapping is certainly still available in the dynamic graph.
To sum it up, there are two types of GPU memory optimization strategies left for dynamic computational graph:
(1) trading computation for GPU memory space, also known as, sublinear GPU memory optimization
(2) trading bandwidth for GPU memory space
An experiment is carried out to find out which of the two optimizing strategies is more worthwhile.

In the figure above, we choose to analyze three intermediate tensors in ResNet-50, which are the output of three operators: Convolution, BatchNorm, and ReLU. For each tensor, we list the memory space used for storing, and compare the time cost of calculating and swapping. Interestingly, we find that the time of swapping is generally longer than the time of computing by two orders of magnitude for each tensor. This is because the speed of data exchanging between the CPU and GPU is bound by the low efficiency of PCIe (for example, when eight 2080Ti graphics cards are trained at the same time, the observed memory bandwidth of each card is about 3GB/s only). In comparison, the speed of calculating these tensors is negligible because of the advanced parallel computing hardware. Therefore, it should be a better strategy to trade computation for GPU memory space.
Then, in order to proceed down further, we implement this strategy in three phases:
- infrastructure: we implement the data structure to record the computational path of each tensor encountered during the computation.
- all-by-user plan: we provide the users with the interfaces to release and restore tensors. Then it is the users who need to devise their own policies on when to release a tensor and when to recalculate in order to restore it.
- all-by-framework plan: Our framework automatically finds a policy and executes it in the background, without any user intervention.
As we all know, there are only two possible ways of how a tensor is created in the process of network training: (1) loaded from external data, such as input data; (2) calculated as the output of a certain operator, such as the result of a convolutional operator. In the latter case, we can record the computational path of the tensor in a structure shown as below:

- A producer is attached to each tensor. If the producer is empty, it means that the tensor is loaded with external data. Otherwise, a computational path must exist, where:
- op represents the operator that generates this tensor;
- inputs represent the input tensors required by this operator;
- outputs represent the output tensors produced by this operator;
- compute_time represents the actual running time of this operator;
- users stores all computational paths that use this tensor as input;
- ref_cnt represents how many times this tensor is referred to.
Here is an illustration of how to use the calculation history to release and restore a tensor:

Let’s walk through a series of tensor calculations in MegEngine: first, we calculate c=a+b. Assuming that the GPU memory space is just large enough to hold these three tensors. Then, as shown in the yellow box, tensor a, b and c can exist in the GPU memory simultaneously. Note that the computational path of c is recorded on the host memory at this step. The next calculation we do is d=a*b. Since there is no space in the GPU memory for d, the framework needs to free tensor c and let d take up the space of c (as shown in the following green box). If the next instruction is print(c), then we are facing two issues:
- Tensor c is no longer in the GPU memory: the framework needs to restore it through recalculation along the computational path recorded for c.
- There is no space left in the GPU memory for c: the framework needs to free one of the currently unused tensors (for example, tensor d).
The gray box depicts how these two issues are resolved. In the end, we are going to print(d). The issue is exactly the same as that of the instruction d=a*b, and what the framework needs to do is shown in the last green box.
With this example, we show that it is possible for the users to be unaware of the tensor’s existence. When any released tensor is going to be accessed, the framework will restore it in the GPU memory, as if it were always stored there.
2.2 DTR strategy
To formulate an efficient strategy for tensor releasing and restoring, we introduce the DTR technique discussed in the paper Dynamic Tensor Rematerialization. Our strategy with DTR is dynamic and heuristic. It is activated to select and release some tensors when the GPU memory usage reaches a threshold. In this strategy, whether a tensor is a good candidate to release depends on three aspects of the tensor: (1) the cost of recalculation, (2) the GPU memory space it occupies, (3) the length of its presence in the GPU memory.

In addition, the DTR paper mentions another aspect: the overhead of searching for the best candidate to release. Indeed, the searching needs to happen on the fly to avoid blocking the process of program execution.
Two approximate optimizations are proposed in the paper to reduce the search space:
(1) Small tensors are ignored: when the size of a tensor is less than 1% of the average in the candidate set, it will not be added to the candidate set.
(2) Only search sqrt(N) (N is the size of the candidate set) tensors in the candidate set when a tensor is to be released.
3.The Implementation in MegEngine
3.1 The core of dynamic computational graph — Tensor Interpreter
Before introducing the implementation of DTR, we will take a closer look at the core of MegEngine for dynamic computational graph: tensor Interpreter (interpreter). It will add the four following operations to the graph execution to manage the tensors in GPU memory:
- Put: load the tensor from the host into GPU memory
- ApplyOp: execute an operator and get its output tensor
- Del: delete a tensor and release the space it occupies in GPU memory
- GetValue: transfer the tensor from GPU to the host memory
3.2 The underlying (C++) implementation of tensor releasing and restoring

As shown in the figure above, each tensor in python has a one-to-one correspondence with its tensorInfo in C++. If the current tensor is selected to be released, MegEngine will reset the pointer of the tensor to release its GPU memory. In order to restore the tensor in the future, some information needs to be recorded in the tensorInfo of the tensor. If you use drop (to trade computation for GPU memory space), you need to record the computation path. If you use swap (to trade bandwidth for GPU memory space), you need to transfer it to the host memory and record it as a host tensor in tensorInfo. When the user accesses the tensor,MegEngine will check its corresponding tensorInfo. If the tensorInfoshows that the tensoris not located in the GPU memory, MegEngine will restore the content of the tensor according to either the computational path or the host tensor recorded in the tensorInfo.
3.3 Tensor execution logic with DTR

This is the pseudocode of the DTR logic. apply_on_physical_tensor (marked as yellow) is what all the ApplyOp needs to do before DTR is introduced. But with DTR in action, we should keep in mind that the input tensors of ApplyOp may not be all in the GPU memory at the moment. First, MegEngine will mark these input tensors. Then, AutoEvict is called to keep the GPU memory occupancy under the threshold. If any of the input tensors is evicted during AutoEvict, Regenerate is called to have it restored, which is the process of recovering x through a series of recursive calls on ApplyOp. Finally, apply_on_physical_tensor gets executed.
3.4 Deletion of tensor
Before DTR is applied, MegEngine decides whether a tensor should be deleted only by its reference count. When the reference count decreases to zero, the tensor will notify the interpreter of a delete operation. However, if we take a step back and reconsider this deleting mechanism in a DTR-enabled context, we realize that the delete could have an adverse effect on the memory optimization scheme in some cases, one of which is described as below:
In this figure, we show a part of a certain computational graph. On the left side, an input tensor of 9MB is passed into a Convolution operator. An output tensor of 25MB is generated and is then passed to the next operator, an Elemwise, as input. The Elemwise operator generates another 25MB tensor as its output. The computation is carried on till the very right side is reached, having five tensors involved in total.

As we all know, the two Elemwise’s in such a combination of operators are not involved in gradient calculation, which means their input tensors (marked as red) will not be used during the backpropagation. Therefore, the two red tensors will actually be deleted immediately after the forward calculation is completed. This will leave us with no other choice but to keep the three green tensors in the GPU memory because the calculation of operators can never be recovered if both input and output tensors are lost. With DTR strategy in play, we have another option other than deleting the two red tensors: “fake deletion”, which is to keep the tensorInfo of the two red tensors and release the GPU memory allocated to them. In this way, we can release all four of the 25MB tensors, keeping the 9MB tensor only in the GPU memory and still being capable of restoring this part of the computational graph when we need to.

The figure above shows the pseudocode of deleting a tensor with DTR enabled in MegEngine. When the interpreter is going to delete a tensor, it will call Free on the tensorInfo of the tensor. Then, it will do either a real or a fake deletion depending on whether the current state is forward calculation. The implementation of fake deletion is simple: mark the deletion and release the GPU memory managed by the tensorInfo of the tensor. The implementation of real deletion is a little complicated: first, update the ref_cnt of the input tensors that generate the tensor. Then, call RemoveDep to check the status of all the tensors that depend on this tensor as input. If the dependent tensors are not in the GPU memory, Regenerate must be called to restore them before the tensor they depend on is deleted.
After all the maintenance mentioned above is completed, the tensorInfo corresponding to the tensor can be safely released. Then, you need to recursively check the computational path of the input tensors of x. If these tensors have ref_cnt=0 and are marked for deletion, you can perform real deletion on them as well.
3.5 Comparison of training time

The following figure shows the comparison between different GPU memory optimizations within MegEngine. We select ResNet-50 and ShuffleNet as the candidate network for the experiment. The baseline is the result without any memory optimization in the dynamic computational graph mode. It can be seen that the maximum batch size of DTR is larger than that of sublinear optimization in static computational graph. The training time of DTR is the same as that of sublinear when using the same batch size.


ResNet-50 and ShuffleNet are the models with relatively fixed structures (or execution paths within the computational graph). If we want to extensively exploit the features of dynamic computational graph in MegEngine, a network such as Single Path One-Shot (SPOS) is more suitable as an example. SPOS has a great number of computational paths, and only one of which is selected and updated in each iteration of the training process. In other words, the branch of instructions that get executed may be different in each iteration of training. In the figure below, the performance of DTR optimization over SPOS is compared with that of the baseline. With DTR enabled, MegEngine pushes the maximum batch size from 100 up to 250 in both single-GPU and eight-GPU scenarios.

3.6 Fragmentation problem and solution
In addition to all the benefits that DTR brings in, we observed that as the usage of GPU memory increases to a high level, tensor releasing and restoring happens more frequently, which results in a serious fragmentation problem. Suppose the total available space in GPU memory is 1G, but in the form of many scattered small blocks, a tensor of approximately 1G is still not able to be loaded into GPU memory at this time. Although defragmentation maintenance helps, it sometimes has a significant impact on overall performance.
To solve this problem, we propose three possible ways of optimization:
- Parameter in-place update
if the intermediate tensor that gets frequently updated is very large, we should update its value at its very location in GPU memory. In MegEngine, this can be done by setting the environment variable INPLACE_UPDATE.
- Improved evaluation strategy in DTR
we make a small improvement by introducing some fragment-related information into the heuristic evaluation function of DTR. We hope to swap out the tensors that are not only large but also have large free space at both ends.

In addition, we also introduce a penalty for the number of recalculations over the same tensor and the other four types of metadata (listed in the figure above) as the parameters of the evaluation function. We also use some hyper-parameters to make the heuristic function capable of focusing on different aspects.
- Using static planning
If the execution paths of a network are always the same in each iteration, we can treat the computational graph of such a network as a static one. Then all the effective methods of static memory allocation can be applied. It will not only avoid the necessity of defragmentation but also significantly reduce the overall usage of GPU memory, making it possible to support larger batch sizes.
Let’s take ResNet-50 as an example. when batch size is 400, the peak value of GPU memory usage in dynamic mode is 9595MB, which is still about 10% more than that of static mode, 8549MB. In addition, MegEngine can ensure the avoidance of the fragmentation problem in static computational graph.

4. Future work
According to what we have learned in this work, using static planning as much as possible will remain as a focus of our future research on memory optimization in dynamic computational graph. Flexible as the network structures can be in dynamic mode, the problem of fragmentation will take its toll on the overall performance of graph execution. In static mode, memory allocation during compile time, as well as other graph optimization technologies, are considered favorable to both performance and memory usage.
Ideally, we hope to come up with a scheme of GPU memory optimization that is applicable to both static and dynamic computational graphs, as shown in the following figure:

Both sublinear optimization for static computational graph and the DTR for dynamic computational graph can be viewed as a type of sequence-to-sequence transformation of the execution sequence, by adding some Drop and Recompute instructions to the original sequence. The difference is that in static mode, we calculate the optimal plan of releasing and recalculating after the entire sequence is obtained. However, in dynamic mode, we insert Drop and Recompute while we are still interpreting the execution sequence. For both types of execution (interpretive execution in dynamic mode and compilatory execution in static mode), a Profile records the runtime information of each operator, such as the length of time each tensor stays in the GPU memory. Then, the user can adjust the computing sequence according to the result in Profile, which is easier to comprehend than the underlying execution logic. Furthermore, the user could use Profile as a means of input to customize strategies for different models.
References:
[1] Kirisame M, Lyubomirsky S, Haan A, et al. Dynamic tensor rematerialization[J]. arXiv preprint arXiv:2006.09616, 2020.
0 comments on “Two Lines of Code to Use a 2080Ti to Achieve What Was Previously Only Possible on a V100”