Performance Slowdown during Distributed Training with 4x RTX 4090 GPUs

Hey, I am currently experiencing a considerable training slowdown while implementing Distributed Training with four RTX 4090 GPUs on various Computer Vision models, such as YOLO and ResNet50. After initiating the training process, I observed a significant drop in power usage from 450W to around 80-90W within just a few seconds, resulting in the training becoming approximately 6-8 times slower.

My setup is liquid-cooled and I’ve verified that the temperature levels for both CPU and GPU are within the acceptable range, so overheating can be ruled out. To further diagnose this issue, I conducted training with Large Model Support (LLMS) and performed a GPU-Burn test. The GPUs operated normally, sustaining 450W power usage without any significant increase in temperature. This peculiar slowdown appears to be exclusive to training Computer Vision models, particularly when the data is distributed across multiple GPUs.

Moreover, I have executed individual tests for CPU, GPU, and I/O and everything is working fine stand-alone, however as soon as you start training a CV model problem occurs. I tried different versions of YOLOs, Resnet50, and noticed similar behavior for all.

System Specs:

CPU: AMD RYZEN Threadripper Pro 5965WX 24-Core 3.80 GHz (Threadripper PRO 2023)
GPU: 4 x Liquid-cooled NVIDIA RTX 4090 24 GB
OS: Ubuntu 22.04
Nvidia-driver: 525.105.17

I’ve done some research and found that numerous people seem to be encountering similar issues and pointing out that p2p is disabled on Nvidia 40 series Cards:

  1. Standard nVidia CUDA tests fail with dual RTX 4090 Linux box - #53 by anto_4090
  2. Parallel training with 4 cards 4090 cannot be performed on AMD 5975WX, stuck at the beginning - #10 by jaybob20
  3. DDP training on RTX 4090 (ADA, cu118) - #6 by Tim_Hanson - distributed - PyTorch Forums
  4. Problems With RTX4090 MultiGPU and AMD vs Intel vs RTX6000Ada or RTX3090 | Puget Systems

I’m considering if the peer-to-peer (p2p) aspect could contribute to this problem. If so, is it possible that an Nvidia driver software update could rectify this issue, or might this be a permanent limitation that hinders the utility of a quad 4090 GPU setup for Deep Learning applications in the future?

I would be grateful for confirmation on whether RTX 4090 GPUs with Distributed Data-Parallel (DDP) are suitable for Deep Learning or if this is more likely to be a software-related issue. I’ve tried using the NVIDIA NGC Docker container with PyTorch2 and CUDA 11.8, as well as the TensorFlow Docker for the ResNet50 benchmark, but neither provided a solution.

Your assistance in this matter is greatly appreciated.

Make sure you have the latest drivers, 535 drivers were just released in Ubuntu packages. This drivers fix an issuer with p2p. The issue being there is no p2p in 4090s, but old drivers reported there was.
When training same model with pytorch lighting running from docker, dual is doubling my batches.
2.03it/s dual 4090s
2.25it/s single 4090s

If you post your docker bench marks command I can test.

Hi @alaapdhall79 ,
I am checking on this. please allow me some time.
Thank you for your patience.


Hi @AakankshaS
Any update on the followup of the matter?
@alaapdhall79 could you solve your issue?

Yes, it was thermal throttling that caused the slowdown and did had nothing to do with drivers.
p2p is indeed disabled on 40/30 series cards but that won’t give major slowdowns. P2P is only available in industry-level cards like A6000, A100, etc afaik.

thanks, that was helpful. actually im looking for p2p solutions and as far as i’ve checked and you also verified the 4090 doesn’t have it. im looking at A6000 now as the least expensive one.