TensorFlow, PyTorch or MXNet? A comprehensive evaluation on NLP & CV tasks with Titan RTX

1. Introduction

There is no doubt that GPUs have been playing a significant role for machine learning practitioners, particularly in deep learning that demands massive parallel computation power. Thanks to the CUDA architecture [1] developed by NVIDIA, developers can exploit GPUs’ parallel computing power to perform general computation without extra efforts. Since CUDA was firstly released in early 2007, NVIDIA has been changing the landscape of GPU market and GPU-driven applications such as deep learning.

After NVIDIA announced the latest Turing architecture and released GeForce 20 series in 2018 fall, the Titan RTX finally arrived at the end of 2018. Table 1.1 presents the major differences between the 20 series GPUs and the representative 10 series GPU, 1080 Ti. In addition to upgrades on the scale of transistors, CUDA Cores, memory capacity, memory bandwidth, two primary new components are the Tensor Cores and ray tracing (RT) cores. Tensor Cores enable Titan RTX to perform high speed float process and massive matrix operation, and Tensor Cores replace anti-aliasing with deep learning super-sampling (DLSS). The RT cores are used to generate reflections and shadows.

Screenshot 2019-04-23 13.51.39.png — **Table 1.1: Specification differences between NVIDIA Titan RTX and other mainstream NVIDIA GPUs.**

The powerful GPUs have driven the whole machine learning and deep learning community prosperous. Popular frameworks with GPU support have been released and iteratively updated. TensorFlow, PyTorch, and MXNet are the most widely used three frameworks with GPU support. Though these frameworks are designed to be general machine learning platforms, the inherent differences of their designs, architectures, and implementations lead to a potential variance of machine learning performance on GPUs. For example, TensorFlow training speed is 49% faster than MXNet in VGG16 training, PyTorch is 24% faster than MXNet. This variance is significant for ML practitioners, who have to consider the time and monetary cost when choosing the appropriate framework with a specific type of GPUs.

Our objective is to evaluate the performance achieved by TensorFlow, PyTorch, and MXNet on Titan RTX. Moreover, by running both the training phase and inference phase of different standard models with mixed precision and single precision, we do not only collect training progress and inference progress but also record the timely operating system (OS) metrics such as GPU utilization, memory utilization. These OS-level metrics further help distinguish the capability that a framework exploits the underlying hardwares.

Experiments on our testbed with Titan RTX have shown that TensorFlow and PyTorch gain slightly faster training speed than MXNet on a relatively large dataset, such as ImageNet and COCO2017, but on rather small images, MXNet obtains the best training performance. This turn-out is quite interesting and may indicate TensorFlow and PyTorch might have great potential in optimizing data-intensive tasks while MXNet is good on general machine learning processing.

Another interesting point is that the mixed precision did a pretty good job in deep learning, in all the cases of our selected experiments we were able to improve the training speed without losing accuracy. This suggests that training with mixed precision have the potential to become a new meta for deep learning tasks.

2. Background

2.1 RTX Series GPUs

The GPU we received from NVIDIA is a Titan RTX, Turing architecture. Compared to existing PC GPUs, Titan RTX is the fastest graphics card ever built for PC users. The high computation efficiency of GPUs drives the developers to include GPU support when designing distribution machine learning frameworks. Initially released in 2015 winter by Google Brain team, TensorFlow is Google Brain’s second-generation machine learning framework. PyTorch was first released in 2015 fall and operated by Facebook. With a pure Pythonic development experience, PyTorch is warmly welcomed by the Python community. Apache MXNet was originally from the academic [2] and now is an Apache incubating project. Amazon has chosen MXNet as its deep learning framework on AWS. These three machine learning frameworks have been widely applied in both industry and academy. Our evaluation will be based on the three frameworks to cover most machine learning practitioners.

There is a rich literature in the field of GPU evaluations. Most evaluation reports are aimed at the performance of different GPUs with standard machine learning models. TensorFlow has built-in benchmarks for performance testing including two GPUs on Tesla architecture — NVIDIA P100 and NVIDIA K80 [3]. MLPerf (https://mlperf.org/results/) presents a series of systematic evaluation on platforms including Google TPUs, Intel CPUs, and NVIDIA GPUs. Until this report is written, MLPerf has not included the latest NVIDIA GPUs such as Titan RTX. Lambda, the AI infrastructure company, has released a blog on 2080 Ti TensorFlow GPU benchmarks (https://lambdalabs.com/blog/best-gpu-tensorflow-2080-ti-vs-v100-vs-titan-v-vs-1080-ti-benchmark/). This blog runs TensorFlow models on GPUs including NVIDIA 2080 Ti, Tesla V100, 1080 Ti, Titan V. Unlike existing evaluations, our objective is to evaluate how the mainstream machine learning frameworks exploit the latest Titan RTX for machine learning training and inference. Inside Titan RTX, Turing Tensor Cores provide multiple precisions for training and inference, from single precision FP32 to half precision FP16 and mixed precision, taking a step much further in performance.

2.2 Mixed Precision

We can have a better model by increasing the size of a neural network, but inevitably it will increase the memory and compute requirements to train the model. Thus mixed precision is introduced as a methodology which enables training deep neural networks using half-precision floating point numbers without any change to model accuracy or modifying hyper-parameter.

When applying mixed precision to training, the activations, weights, and gradients are stored in FP16, reducing memory pressure for storage and matrix operations. Master weights are maintained in FP32, and updated with the FP16 result of the forward and backward pass on each layer.

image (28).png — **Figure 2.2.1: Process of training with mixed precision.**

We used the experiments with FP32 precision as our baseline, i.e., activations, weights, gradients, and all operations are stored in single-precision. Some selected tasks are also by mixed precision for further comparative analysis.

3. Evaluation

In this section, we will present the configurations of our testbed, a desktop with off-the-shelf components. The benchmark models and the collected metrics will also be described. We write down as much detail as possible to ensure our evaluation is reproducible.

3.1 Testbed

We have installed the Titan RTX on a testbed computer that is representative for most mainstream PCs. We believe our testbed is representative and affordable for most of our readers. Besides, performance on more high-end machines with SSD and DDR4 memory can be roughly inferred based on our testbed. RTX is known for gaming and entertainment with most recent campaigns. It is very likely for our readers to just add RTX to their current home workstation that they use for works, study, as well as gaming. One of the goals of this review is to provide our readers with a reference to how the performance will be like under this scenario.

The specs is showed in Table 3.1.1.

Screenshot 2019-04-23 13.55.15.png — **Table 3.1.1: Testbed configurations.**

Typically, in pursuit of consistency, we pull the latest TensorFlow, PyTorch and MXNet images respectively from NVIDIA GPU Cloud (NGC). Additionally, this further simplifies the setup of evaluation environment. The framework versions and the driver we install are shown in Table 3.1.1. For all frameworks, we use FP32 precision by default. We will compare the performance of mixed precision with single precision in Section 6.

Screenshot 2019-04-23 13.56.39.png — **Table 3.1.2: Versions of frameworks and drivers.**

3.2 Benchmarks and Metrics

Third-parties such as MLPerf (https://mlperf.org) have made detailed training performance results within multiple GPUs (https://mlperf.org/results/). But now MLPerf didn’t cover performance in this report, instead, we will only cover series of experiments on Titan RTX GPU. The experiments contains various types of Computer Vision and Natural Language Processing tasks. We will explore their inference and training speed on various scales and different precisions. For ML practitioners, this technical report will present an intuitive look on Titan RTX performance in frequently-used models, so you can better compare and decide the ideal device to choose.

In order to give the audience an intuitive impression on the results, we follow the official setting of each network, e.g. batch_size 128 for VGG. Faster-RCNN has two inherent networks inside the project, the RPN network branch will generate multiple proposals (by our setting 256), hence the batch_size is not seemingly small and considering implementation details between different frameworks, batch_size 1 is the most stable one and maybe more straightforward for our audience to replicate, so we choose this value.

We extend the evaluation experiments on Titan RTX GPU to different popular Frameworks: TensorFlow, PyTorch, and MXNet on different datasets: COCO2017, CIFAR-10, ImageNet 2012, WMT16 English-German, MovieLens-1M and text8. As for task options, we choose two classification tasks on different scale datasets and one detection task: ResNet-50 on CIFAR-10 classification, VGG16 on ImageNet 2012 classification, Faster-RCNN on COCO2017 detection. And also for NLP-wise, we choose the most dominant models for three popular tasks including Google Neural Machine Translation for machine translation, Neural Collaborative Filtering for recommender system and Word2Vec for word embeddings. Each experiment follows its official settings from its original repository.

As for evaluation metrics, we present GPU utilization percentage, Memory utilization percentage, GPU Memory used, CPU utilization percentage, Memory utilization percentage, CPU Memory used and training/inference speed. In this case, you can have a comprehensive impression on each task.

These utilization metrics are eventually presented as average values. The data were recorded with an interval of 5 seconds, and average utilization is calculated after the experiment based on the recorded data. At last but not least, since the mixed precision is newly supported by Titan RTX, we evaluated different models under mixed precision and single precision (FP32). The difference between training and inference under mixed precision and single precision will also be presented.

Screenshot 2019-04-23 13.58.37.png — **Table 3.2.1: Benchmarks in our evaluation.**

4. Results on CV tasks

In this section, we ran all CV tasks with single precision. Followed by all set-up steps and experimental settings, we present the details of the result of CV tasks as follows:

4.1. Experiment1 Resnet-50 on Cifar10

Settings:
Experiment: ResNet-50 Inference
Framework: NGC TensorFlow 18.12/NGC PyTorch 19.01/NGC MXNet 19.01
Batch size: 64 (inference)

Screenshot 2019-04-23 14.01.51 — **Table 4.1.1: ResNet-50 inference performance and resource utilization with s precision.**

Settings:
Experiment: ResNet-50 Training
Framework: NGC TensorFlow 18.12/NGC PyTorch 19.01/NGC MXNet 19.01
Batch size: 128 (training)

Screenshot 2019-04-23 14.05.39.png — **Table 4.1.2: ResNet-50 training performance and resource utilization with single precision.**

4.2. Experiment2 VGG16 on Imagenet

Settings:
Experiment: VGG16 Inference
Framework: NGC TensorFlow 18.12/NGC PyTorch 19.01/NGC MXNet 19.01
Batch size: 64 (inference)

Screenshot 2019-04-23 14.08.01.png — **Table 4.2.1: VGG16 inference performance and resource utilization.**

Settings:
Experiment: VGG-16 Training
Framework: NGC TensorFlow 18.12/NGC PyTorch 19.01/NGC MXNet 19.01
Batch size: 128 (training)

Screenshot 2019-04-23 14.11.56.png — **Table 4.2.2: VGG-16 training performance and resource utilization with single precision.**

4.3. Experiment3 Faster-rcnn on COCO 2017

The batch size of 1 is chosen for the Faster-RCNN experiment.

Batch size of 1 is only set for the Faster-RCNN experiment due to the specification of this algorithm – it could be increased to 4 with some modification, but we decided to stay with the original implementation. All other experiments are with the common batch size of either 64 or 128.

Settings:
Experiment: Faster-RCNN Inference
Framework: NGC TensorFlow 18.12/NGC PyTorch 19.01/NGC MXNet 19.01
Batch size: 1 (inference)

Screenshot 2019-04-23 14.13.27.png — **Table 4.3.1: Faster-RCNN inference performance and resource utilization.**

Settings:
Experiment: Faster-RCNN Training
Framework: NGC TensorFlow 18.12/NGC PyTorch 19.01/NGC MXNet 19.01
Batch size: 1 (training)

Screenshot 2019-04-23 14.15.15.png — **Table 4.3.2: Faster-RCNN training performance and resource utilization with single precision.**

4.4. Result analysis

We visualize the evaluation data to present an intuitive comparison between different frameworks and tasks. A few interesting insights have been derived from our observation, for example,

Figure 4.4.1 and Figure 4.4.2 present the inference speed and training speed of different CV models:

TensorFlow achieves the best inference speed in ResNet-50 , MXNet is fastest in VGG16 inference, PyTorch is fastest in Faster-RCNN.

MXNet has the fastest training speed on ResNet-50, TensorFlow is fastest on VGG-16, and PyTorch is the fastest on Faster-RCNN.

To summarize GPU/CPU utilization and memory utilizations, we plot different charts to compare across frameworks and experiments.

**Figure 4.4.3: GPU utilization of inference.**

Three Frameworks take full GPU utilization on VGG-16, PyTorch version FRCNN takes the least GPU utilization due to its code optimization. On average TensorFlow takes the most GPU utilization across all inference tasks.

**Figure 4.4.4: GPU memory utilization inference.**

MXNet consumes the least GPU memory utilization in ResNet-50 inference, TensorFlow consumes the least in VGG16 ones and PyTorch consumes the least in FasterRCNN. On average, TensorFlow, and PyTorch consume similar memory, MXNet consumes the least in inference.

**Figure 4.4.5: CPU utilization of inference.**

On average, TensorFlow consumes the least CPU utilization, while PyTorch consumes the most in inference tasks.

**Figure 4.4.6: CPU memory utilization of inference.**

On average, TensorFlow takes the most CPU memory in inference tasks, PyTorch and MXNet consume similar memory resource.

**Figure 4.4.7: GPU utilization at training.**

During training, PyTorch utilizes the most GPU resources, while TensorFlow consumes the least.

**Figure 4.4.8: GPU memory utilization at training.**

During training, PyTorch consumes the most GPU memory resources, while TensorFlow consumes the least.

**Figure 4.4.10: Memory utilization at training.**

In training tasks, MXNet consumes the least CPU resources while TensorFlow consumes the most on average.

For training, PyTorch consumes the most CPU memory while MXNet and TensorFlow consume similar memory utilizations on average.

Note that all experiments use open-source code on GitHub. Some code may have specific performance optimization, which might lead to difference on final results.

We have found a few interesting observations from the above charts. When training on ResNet-50, MXNet is the fastest framework compared to the other frameworks. When performing the VGG-16 tasks, all three frameworks have fully utilized the GPU, but TensorFlow achieves the fastest sample training speed while MXNet is the slowest. In detection experiments, PyTorch version Faster-RCNN outperforms significantly than the other two frameworks (but there could be some extra optimization efforts in PyTorch version code). All these findings above may inspire us that, even on the same computing device, different types of tasks or different frameworks can lead to performance fluctuation, as well as your dataset and code optimization methods.

5. Results on NLP tasks

In this section, we ran all NLP tasks with single precision. Followed by all setup steps and experiment settings, we present the details of the results of NLP tasks as follows:

5.1 Experiment-4 Google Neural Machine Translation

Settings:
Experiment: Google Neural Machine Translation Training
Framework: NGC TensorFlow 19.02/NGC PyTorch 19.02/NGC MXNet 19.02
Batch size: 128 (training)
Dataset: WMT16 English-German

Screenshot 2019-04-23 14.17.41.png — **Table 5.1.1: Google Neural Machine Translation training performance and resource utilization with single precision.**

Screenshot 2019-04-23 14.18.41.png — **Table 5.1.2: Google Neural Machine Translation training performance and resource utilization with mixed precision.**

Settings:
Experiment: Neural Machine Translation Inference
Framework: NGC TensorFlow 19.02/NGC PyTorch 19.02/NGC MXNet 19.02
Batch size: 128 (Inference)
Dataset: newstest2014

Screenshot 2019-04-23 14.19.46.png — **Table 5.1.3:** **Neural Machine Translation inference performance and resource utilization.**

5.2 Experiment5 Recommendation System – Neural collaborative filtering (NCF)

Settings:
Experiment: NCF Training
Framework: NGC TensorFlow 19.02/NGC PyTorch 19.02/NGC MXNet 19.02
Batch size: 256 (training)
Dataset: MovieLens-1M

Screenshot 2019-04-23 14.20.43.png — **Table 5.2.1:** **Neural Collaborative Filtering training performance and resource utilization with single precision.**

Screenshot 2019-04-23 14.21.27.png — **Table 5.2.2:** **Neural Collaborative Filtering training performance and resource utilization with mixed precision.**

Settings:
Experiment: NCF Inference
Framework: NGC TensorFlow 19.02/NGC PyTorch 19.02/NGC MXNet 19.02
Batch size: 100 (inference)
Dataset: MovieLens-1M test

Screenshot 2019-04-23 14.43.30.png — **Table 5.2.3: Neural Collaborative Filtering inference performance and resource utilization.**

5.3 Experiment6 Wordembedding – word2vec

Settings:
Experiment: Word2Vec: Skip-Gram Modelling
Framework: NGC TensorFlow 19.02/NGC PyTorch 19.02/NGC MXNet 19.02
Batch size: 256 (training)
Dataset: text8

Screenshot 2019-04-23 14.51.01.png — **Table 5.3.1:** **Word2Vec training performance and resource utilization with single precision.**

5.4 Result analysis

**Figure 5.4.1: Training speed with single precision of different NLP models (steps/sec).**

MXNet achieves the best training speed for GNMT task, PyTorch is the fastest in NCF training and TensorFlow is the fastest in Word2Vec training.

Charts_NLP Inference_001.png — **Figure 5.4.2:** **Inference speed of different frameworks for Neural Machine Translation.**

The training speed of TensorFlow and MXNet are approximately the same for both GNMT and NCF tasks. However, PyTorch achieves much better performance.

To summarize GPU/CPU utilization and Memory utilization, we plot different charts to compare across frameworks and experiments.

**Figure 5.4.3: GPU utilization at training.**

GPU utilization of TensorFlow in Word2Vec training is extraordinary higher than the others. PyTorch has the highest GPU utilization in GNMT training while lowest in NCF training.

Charts_NLP Inference_002.png — **Figure 5.4.4: GPU utilization of inference.**

For GNMT task, PyTorch has the highest GPU utilization, but in the meantime, its inference speed outperforms the others. For NCF task, despite the fact that there is no significant difference between all three frameworks, PyTorch is still a better choice as it has a higher inference speed when GPU is the main concerning point.

**Figure 5.4.5: GPU memory utilization time training.**

MXNet has the highest GPU memory utilization time in GNMT and Word2Vec training, while they were almost negligible for PyTorch and MXNet in NCF training. Overall MXNet used the least GPU memory utilization time for all tasks.

Charts_NLP Inference_003.png — **Figure** **5.4.6: GPU Memory Utilization Time of inference.**

TensorFlow has a higher percentage of time over the past sample period during the device memory was being read or written, but GPU is not a needed requirement for PyTorch and MXNet to do inference for both GNMT and NCF task, especially for NCF task (percent of time round to 0.00% when under 0.50%).

**Figure 5.4.7 : CPU utilization at training.**

On average, the CPU utilization was evenly distributed for all frameworks at training steps.

Charts_NLP Inference_004.png — **Figure 5.4.8: CPU Utilization of inference.**

**Figure 5.4.9: Memory utilization at training.**

On average, TensorFlow takes the least memory at training for all tasks, PyTorch takes highest memory for NCF and Word2Vec tasks.

Charts_NLP Inference_005.png — **Figure 5.4.10: Memory utilization of inference.**

There is no vast difference between all three frameworks.

Every framework exhibits different running performance even when training the same neural network on the same hardware platform, due to the different optimization methods by vendors. For NMT tasks, which are known to be computationally expensive both in training and in translation inference, MXNet achieves the best performance, with lower GPU utilization but higher CPU utilization. For recommendation tasks, there is no noticeable variation on training steps but on inference steps, the performance of PyTorch is outstanding. For Word2Vec task, TensorFlow outperforms the others, but it has a higher GPU utilization.

6. Results on Mixed Precision and Single Precision

We compared the performance and efficiency of the three frameworks when performing training and inference with mixed precision and single precision. Our evaluation on Titan RTX has shown that both training and inference under the mixed precision outperform under the single precision. This observation motivates the necessity to add mixed precision support to GPUs for ML tasks.

6.1. ResNet-50

To evaluate the performance of each framework on mixed precision as well as the performance gap between mixed precision and single precision, we ran ResNet-50 on the three frameworks with mixed precision and single precision respectively. The ResNet-50 code repository for the three frameworks is provided by NVIDIA (https://github.com/NVIDIA/DeepLearningExamples ).

It should be noted in our evaluation, we have found that PyTorch has not fully utilized the GPU and achieved the slowest image process speed among the three frameworks. The ResNet-50 implementation of PyTorch by NVIDIA might not be fully optimized. In addition, MXNet ran out of memory with single precision when batch size is 256, we then switched to the batch size of 208.

Figure 6.1.1 and Figure 6.1.2 present the image processed per second during training and inference respectively. The speed of mixed precision is nearly two times than the single precision except for PyTorch.

Charts_Mixed Precision-CV_001.png — **Figure 6.1.1: ResNet-50 training speed.**

Charts_Mixed Precision-CV_002.png — **Figure 6.1.2: ResNet-50 inference speed.**

As in Figure 6.1.3, though training at mixed precision is faster, it consumes less GPU utilization than single precision. Half precision computation reduces the computing complexity and relieve the stress on storage.

Charts_Mixed Precision-CV_003.png — **Figure 6.1.3: ResNet-50 GPU Utilization at training.**

Figure 6.1.4 shows the GPU time used by different frameworks when training ResNet-50.

Charts_Mixed Precision-CV_004.png — **Figure 6.1.4: ResNet-50 GPU utilization time at training.**

TensorFlow consumed much more CPU utilization than the other two frameworks, particularly, TensorFlow with mixed precision utilizes CPU to around 66% in Figure 6.1.5. The CPU utilization is low since most workloads are assigned to GPU.

**Figure 6.1.5: ResNet-50 CPU utilization at training.**

All three frameworks consumed similar amount of memory according to Figure 6.1.6.

Charts_Mixed Precision-CV_006.png — **Figure 6.1.6: ResNet-50 Memory Utilization at training.**

Similar to the GPU utilization at training in Figure 6.1.3, Figure 6.1.7 shows that frameworks consume less GPU utilization at inference with mixed precision.

Charts_Mixed Precision-CV_007.png — **Figure 6.1.7: ResNet-50 GPU utilization at inference.**

Inference with single precision has utilized more GPU memory utilization time than with mixed precision, shown in Figure 6.1.8.

Charts_Mixed Precision-CV_008.png — **Figure 6.1.8: Memory utilization time at inference.**

Similar to training in Figure 6.1.5, CPU utilization at inference is also low in Figure 6.1.9.

Charts_Mixed Precision-CV_009.png — **Figure 6.1.9: CPU utilization at inference.**

Figure 6.1.10 shows that inference consumes less memory than training. Though we only have 16GB memory, it is still not the bottleneck for Titan RTX when performing training and inference of ResNet-50.

Charts_Mixed Precision-CV_010.png — **Figure 6.1.10: Memory utilization at inference.**

6.2. NLP Tasks

To evaluate the performance of each framework on mixed precision, as well as the performance gap between mixed precision and single precision, we ran Google Neural Machine Translation (GNMT) on the TensorFlow and PyTorch frameworks with mixed precision and single precision respectively.

Screenshot 2019-04-23 15.48.54.png — **Table 6.2.1: Comparison of mixed precision training and single precision training of GNMT task.**

Charts_Mixed Precision - NLP_001.png — **Figure 6.2.1:** **Training speed between mixed precision and f32 precision of GNMT task (steps/sec).**

Mixed precision achieves a better performance than single precision, especially under PyTorch framework, from which we can see there is a noticeable variation.

Charts_Mixed Precision - NLP_003.png — **Figure 6.2.2:** **GPU utilization between mixed precision and f32 precision of GNMT task.**

Charts_Mixed Precision - NLP_006.png — **Figure 6.2.3:** **GPU memory utilization time between mixed precision and f32 precision of GNMT task.**

Under TensorFlow framework, mixed precision has a lower GPU utilization and memory utilization time but yet has a faster speed. For PyTorch, although the GPU utilization and memory utilization time are higher, the corresponding performance has been improved significantly.

Charts_Mixed Precision - NLP_007.png — **Figure 6.2.4: CPU utilization between mixed precision and f32 precision of GNMT task.**

Charts_Mixed Precision - NLP_009.png — **Figure 6.2.5: Memory utilization between mixed precision and f32 precision of GNMT task.**

TensorFlow and PyTorch have minor difference results with mixed precision a bit higher on the proposed CPU.

Screenshot 2019-04-23 15.55.04.png — **Table 6.2.2:** **Comparisons of Mixed Precision Training and Single Precision Training of NCF Task.**

Charts_Mixed Precision - NLP_002.png — **Figure 6.2.6:** **Training speed comparison between mixed precision and single precision of NCF task (steps/sec).**

Similar to the performance on GNMT task, the training speed on NCF task is accelerated with mixed precision.

Charts_Mixed Precision - NLP_004.png — **Figure 6.2.7: Training speed comparison between mixed precision and single precision of NCF task.**

Charts_Mixed Precision - NLP_005.png — **Figure 6.2.8: GPU memory utilization time comparison between mixed precision and single precision of NCF task.**

NCF training consumes higher GPU utilization and memory utilization time with mixed precision.

Charts_Mixed Precision - NLP_008.png — **Figure 6.2.9:** **GPU memory utilization time comparison between mixed precision and single precision of NCF task.**

Charts_Mixed Precision - NLP_010.png — **Figure 6.2.10:** **GPU memory utilization time comparison between mixed precision and single precision of NCF task.**

Single precision has a higher cpu utilization and memory utilization than mixed precision.

In conclusion, training model with mixed precision achieves higher speed than the ones with single precision, without sacrificing model accuracy.

6.3. Summary

For CV models, half precision supported by Titan RTX extensively speeds up the image processing in both training and inference. In general, half precision training and inference consume less GPU utilization. Also for NLP tasks, we have demonstrated that deep learning models can be trained with mixed precision without losing accuracy while accelerating training speed. Overall, our experiments suggest that half precision storage is highly recommended as a regularizer during training, we believe that mixed precision can be an important technique which allows us to reduce arithmetic operations, thus reduce the requirements of GPU.

7. Conclusions

In this report, we have evaluated three mainstream machine learning frameworks on the latest Titan RTX GPU. The evaluation on our representative testbed has shown that the Titan RTX has brought a huge increase in training and inference of CV models and NLP models, particularly with the mixed precision support. We have also observed the performance gaps between frameworks on utilizing GPUs for different models. These performance gaps are typically crucial for machine learning developers when they decide the right combination of machine learning tasks, frameworks, and hardware.

This report has only revealed a small corner of the various combinations of software and hardware. There is an adequate space for us to explore and evaluate, such as TensorRT, which may bring 45x increase in inference speed with Tesla V100 GPUs compared to CPU-based platforms, claimed by the ML team at SAP. We will further push forward our evaluation on more models, frameworks and hardware in our future work.

8. Acknowledgements

We are very appreciated that NVIDIA supported us with a Titan RTX GPU without any constraints on writing. The series of evaluations we performed on Titan RTX GPU sticks to the principle of being neutral and fair. Finally, thanks a lot for the support from Synced Global Office and our friend in UofT Jack Luo.

Appendix: Analysts’ Memo

Referred by all the results mentioned above, Titan RTX is well prepared for both training and inference on various computer vision (CV) tasks, even under a large batch size. Since Titan RTX has larger GPU memory than the other RTX 20x series GPUs, general training tasks can be fully placed into its memory, which extensively reduces the time cost compare to multi-card training. Besides, the brand new Turing architecture gives more control over the GPU, in a way it can free up some CPU occupancy. And powerful Tensor Cores enable faster speed on general Computer Vision missions. As for explicit experiments result, we found TensorFlow and PyTorch may perform better on data-intensive computer vision tasks, and MxNet performs well on general small dataset training. For resource utilization, PyTorch can wisely make use of our GPU. For NLP tasks, no single framework can outperform others. We show that scalability of TensorFlow is worse than others for some task, i.e., Google Neural Machine Translation, which may result from that TensorFlow calculates the gradient aggregation and updated model on CPU side.

Besides different frameworks performance on Titan RTX GPU, let’s compare more hardware features with other mainstream GPUs in the market which have been released previously. Quickly skim through the specs in Table 1.1, compared to the other three Geforce series GPUs, Titan RTX has the most CUDA Cores, the largest memory bandwidth and bus-width, which leads to the most powerful matrix computation acceleration for Deep Learning. On these three key parameters, RTX 2080 Ti is comparably closer to Titan RTX in configuration, and both deploy the latest Turing Architecture. In the GPU market, GTX 1080 Ti has been a very classic GPU. Based on old Pascal architecture, GTX 1080 Ti is surpassed by RTX 2080 Ti (you can refer to some previous post for comparison details 1. https://lambdalabs.com/blog/2080-ti-deep-learning-benchmarks/
2.https://gpu.userbenchmark.com/Compare/Nvidia-RTX-2080-Ti-vs-Nvidia-GTX-1080-Ti/4027). For RTX 2080 Ti, as a Geforce GPU designed for gaming, due to the relatively limited GPU video memory size and other less eye-catching key features, it might not be my first choice in Deep Learning device choice. Hence, we would say for common Computer Vision tasks, even though RTX 2080Ti can fulfill some of my requirements in memory capacity and model acceleration, we recommend Titan RTX due to its 24GB GDDR6 memory, which extensively saves space for multi-card configuration and does reduce transmission time between multiple card. We can painlessly train a relatively large dataset in my Deep Learning tasks. For even larger scale deep learning tasks, we recommend trying NVIDIA Tesla series GPUs in a datacenter, rather than Titan RTX.

Compared to single precision, mixed precision has its apparent advantages, except that it requires hardware support and most existing models do not provide a mixed precision option to train or to deploy. We are looking forward to that ML frameworks implement mixed precision as a built-in feature when constructing models with official APIs.

References

[1] http://developer.download.nvidia.com/compute/cuda/docs/CUDA_Architecture_Overview.pdf
[2] MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems, https://github.com/dmlc/web-data/raw/master/mxnet/paper/mxnet-learningsys.pdf
[3] TensorFlow Benchmark, https://www.tensorflow.org/guide/performance/benchmarks
[4] TensorFlow Train Res50 (/workspace/models/official/resnet)
Cifar10 https://github.com/tensorflow/models/tree/master/official/resnet
[5] TensorFlow Train VGG16
Imagenet https://github.com/tensorflow/models/tree/master/research/slim
[6] TensorFlow Inference Res50/VGG16
Imagenet/cifar10https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks
[7] TensorFlow FasterRCNN COCO https://github.com/tensorpack/tensorpack（python3 xx.py）
[8] PyTorch Train Res50 Cifar10 https://github.com/kuangliu/pytorch-cifar
[9] PyTorch Train VGG16
https://github.com/pytorch/examples/tree/master/imagenet
Imagenet https://github.com/pytorch/examples/tree/master/imagenet
https://github.com/ryujaehun/pytorch-gpu-benchmark/blob/master/benchmark_models.py
https://gist.github.com/tdeboissiere/12a5e814e9eff3d2cb2c29ff100a09f0
[10] PyTorch Inference Res50
Cifar10 https://github.com/kuangliu/pytorch-cifar
[11] PyTorch Inference VGG16
Imagenet https://github.com/pytorch/examples/tree/master/imagenet
[12] PyTorch FasterRCNN
https://github.com/ruotianluo/pytorch-faster-rcnn
[13] MXNet Train Res50
https://github.com/tornadomeet/ResNet
Cifar10 https://github.com/apache/incubator-mxnet/tree/master/example/image-classification
https://mxnet.incubator.apache.org/api/python/gluon/model_zoo.html
[14] MXNet Train VGG16
Imagenet https://mxnet.incubator.apache.org/api/python/gluon/model_zoo.html
[15] MXNext Inference Res50
https://www.leadergpu.com/articles/432-mxnet-benchmark
https://github.com/apache/incubator-mxnet/tree/master/example/image-classification
[16] MXNet Inference VGG16
https://mxnet.apache.org/model_zoo/index.html
[17] MXNet FasterRCNN
https://github.com/ijkguo/mx-rcnn
[18] https://www.tomshardware.com/news/nvidia-titan-rtx-specs-pricing,38184.html
[19] https://www.hardwarezone.com.sg/feature-nvidia-geforce-rtx-2080-and-2080-ti-review-guess-who-has-fastest-cards-again/test-setup-gaming-performance

Analyst: Angulia Chao, Hecate He | Producer: H4O, Mos, Chain | Produced by Synced Lab

1 comment on “TensorFlow, PyTorch or MXNet? A comprehensive evaluation on NLP & CV tasks with Titan RTX”

Michael

2020-12-30

Thank you for such a detailed analysis. Still, I believe that desktop cards have a significant advantage – price. Yes, building multiple GPUs at home can be problematic, but in the cloud it is not a problem. Several cloud GPU services offer solutions on 2080Ti, such as vast.ai and https://puzl.ee/gpu-cloud. On Puzl, you can select up to 10 GPUs at a time. And what do you think of the RTX 3090? it would be interesting to know the prices when they appear in the clouds. 3090 solves the issue of both the number of CUDA cores and the volume of video memory

Loading...

TensorFlow, PyTorch or MXNet? A comprehensive evaluation on NLP & CV tasks with Titan RTX

1. Introduction