A100 vs. V100 for ML Training

arthur.mccallum · March 17, 2021, 3:51pm

Hello NVIDIA Forum!

Great to be here - I hope this post is in the right place. Happy to move if not.

We are comparing the performance of A100 vs V100 and can’t achieve any significant boost. Pls see the numbers below:

LambdaLabs benchmarks (see A100 vs V100 Deep Learning Benchmarks | Lambda):
- 4 x A100 is about 55% faster than 4 x V100, when training a conv net on PyTorch, with mixed precision.
- 4 x A100 is about 170% faster than 4 x V100, when training a language model on PyTorch, with mixed precision.
- 1 x A100 is about 60% faster than 1 x V100, when training a conv net on PyTorch, with mixed precision.
NVIDIA benchmarks (see https://developer.nvidia.com/deep-learning-performance-training-inference):
- 1 x A100 is around 60-70% faster than 1 x V100, when training a conv net on PyTorch, with mixed precision.
- 1 x A100 is around 100-120% faster than 1 x V100, when training a conv net on TensorFlow, with mixed precision.
- 8 x A100 is around 70-80% faster than 8 x V100, when training a conv net on PyTorch, with mixed precision.
- 8 x A100 is around 70-80% faster than 8 x V100, when training a conv net on TensorFlow, with mixed precision.
Our synthetic benchmark:
- 1 x A100 is only 30% faster than 1 x V100, when training a conv net on PyTorch, with mixed precision.
A client training task:
- 4 x A100 is only 5% faster than 4 x V100, when training a conv net on PyTorch, with mixed precision.

Question:
What could be the reason for us getting only 30% and 5% increases, instead of ~60% in both cases?

Thank you very very much for your help!

dsingalNV · March 24, 2021, 4:30pm

There could be a very large number of reasons for training performance to suffer. Here’s a checklist of things I would check out or confirm:

Is the vision model something off the shelf with a standard backbone or something from scratch? If it’s a model from scratch some things to check would be to see if a certain layer is bottlenecking the later layers or if Tensor Cores aren’t being used(matrix dimensions need to be divisible by 8)
Is the software the same or similar to what the benchmarks are using? The NVIDIA Pytorch container at NGC is built to have the best performance on NVIDIA GPUs
Profiling: Use something like pyprof or dlprof to get a GPU centric profile of what is happening and when to see if the hardware is being leveraged properly or if there’s some sort of bottleneck eg IO/data preprocessing if the GPU is stalled for periods of time.

Topic		Replies	Views
Performance of A100 vs. V100s for mixed pression CUDA Programming and Performance	1	1013	December 3, 2021
Difference between A100 vs RTX 4090 in training deep learning models TensorRT cuda , python	2	695	November 30, 2024
Tesla V100 GPU way too slow CUDA Programming and Performance	8	6531	December 21, 2017
Getting Immediate Speedups with NVIDIA A100 TF32 Technical Blog	1	475	November 15, 2020
Why 4090 training slower than P100 even writing same piece of code? GPU - Hardware	2	2261	February 20, 2025
GV100 performance issues Frameworks (archived) tensorflow	4	778	June 24, 2019
Mixed Precision on 2080 Ti Deep Learning (Training & Inference) tensorflow , ai-training	0	502	July 12, 2020
slower performance in container when using V100 Frameworks (archived) tensorflow	2	1430	June 15, 2018
Tesla V100 is slower than RTX 2080ti CUDA Programming and Performance	6	2100	October 12, 2021
Trained model giving slightly different values when tested on P100 and V100 . is there a way to make it consistent.? CUDA Programming and Performance pytorch	0	331	April 29, 2021