A100 vs. V100 for ML Training

Hello NVIDIA Forum!

Great to be here - I hope this post is in the right place. Happy to move if not.

We are comparing the performance of A100 vs V100 and can’t achieve any significant boost. Pls see the numbers below:

  • LambdaLabs benchmarks (see A100 vs V100 Deep Learning Benchmarks | Lambda):

    • 4 x A100 is about 55% faster than 4 x V100, when training a conv net on PyTorch, with mixed precision.
    • 4 x A100 is about 170% faster than 4 x V100, when training a language model on PyTorch, with mixed precision.
    • 1 x A100 is about 60% faster than 1 x V100, when training a conv net on PyTorch, with mixed precision.
  • NVIDIA benchmarks (see https://developer.nvidia.com/deep-learning-performance-training-inference):

    • 1 x A100 is around 60-70% faster than 1 x V100, when training a conv net on PyTorch, with mixed precision.
    • 1 x A100 is around 100-120% faster than 1 x V100, when training a conv net on TensorFlow, with mixed precision.
    • 8 x A100 is around 70-80% faster than 8 x V100, when training a conv net on PyTorch, with mixed precision.
    • 8 x A100 is around 70-80% faster than 8 x V100, when training a conv net on TensorFlow, with mixed precision.
  • Our synthetic benchmark:

    • 1 x A100 is only 30% faster than 1 x V100, when training a conv net on PyTorch, with mixed precision.
  • A client training task:

    • 4 x A100 is only 5% faster than 4 x V100, when training a conv net on PyTorch, with mixed precision.

Question:
What could be the reason for us getting only 30% and 5% increases, instead of ~60% in both cases?

Thank you very very much for your help!

There could be a very large number of reasons for training performance to suffer. Here’s a checklist of things I would check out or confirm:

  1. Is the vision model something off the shelf with a standard backbone or something from scratch? If it’s a model from scratch some things to check would be to see if a certain layer is bottlenecking the later layers or if Tensor Cores aren’t being used(matrix dimensions need to be divisible by 8)
  2. Is the software the same or similar to what the benchmarks are using? The NVIDIA Pytorch container at NGC is built to have the best performance on NVIDIA GPUs
  3. Profiling: Use something like pyprof or dlprof to get a GPU centric profile of what is happening and when to see if the hardware is being leveraged properly or if there’s some sort of bottleneck eg IO/data preprocessing if the GPU is stalled for periods of time.
1 Like