Performance Slowdown during Distributed Training with 4x RTX 4090 GPUs

alaapdhall79 · June 27, 2023, 4:43am

Hey, I am currently experiencing a considerable training slowdown while implementing Distributed Training with four RTX 4090 GPUs on various Computer Vision models, such as YOLO and ResNet50. After initiating the training process, I observed a significant drop in power usage from 450W to around 80-90W within just a few seconds, resulting in the training becoming approximately 6-8 times slower.

My setup is liquid-cooled and I’ve verified that the temperature levels for both CPU and GPU are within the acceptable range, so overheating can be ruled out. To further diagnose this issue, I conducted training with Large Model Support (LLMS) and performed a GPU-Burn test. The GPUs operated normally, sustaining 450W power usage without any significant increase in temperature. This peculiar slowdown appears to be exclusive to training Computer Vision models, particularly when the data is distributed across multiple GPUs.

Moreover, I have executed individual tests for CPU, GPU, and I/O and everything is working fine stand-alone, however as soon as you start training a CV model problem occurs. I tried different versions of YOLOs, Resnet50, and noticed similar behavior for all.

System Specs:

CPU: AMD RYZEN Threadripper Pro 5965WX 24-Core 3.80 GHz (Threadripper PRO 2023)
GPU: 4 x Liquid-cooled NVIDIA RTX 4090 24 GB
NVME SSD
OS: Ubuntu 22.04
Nvidia-driver: 525.105.17

I’ve done some research and found that numerous people seem to be encountering similar issues and pointing out that p2p is disabled on Nvidia 40 series Cards:

I’m considering if the peer-to-peer (p2p) aspect could contribute to this problem. If so, is it possible that an Nvidia driver software update could rectify this issue, or might this be a permanent limitation that hinders the utility of a quad 4090 GPU setup for Deep Learning applications in the future?

I would be grateful for confirmation on whether RTX 4090 GPUs with Distributed Data-Parallel (DDP) are suitable for Deep Learning or if this is more likely to be a software-related issue. I’ve tried using the NVIDIA NGC Docker container with PyTorch2 and CUDA 11.8, as well as the TensorFlow Docker for the ResNet50 benchmark, but neither provided a solution.

Your assistance in this matter is greatly appreciated.

jaybob20 · June 27, 2023, 3:38pm

Make sure you have the latest drivers, 535 drivers were just released in Ubuntu packages. This drivers fix an issuer with p2p. The issue being there is no p2p in 4090s, but old drivers reported there was.
When training same model with pytorch lighting running from docker, dual is doubling my batches.
2.03it/s dual 4090s
2.25it/s single 4090s

If you post your docker bench marks command I can test.

AakankshaS · June 28, 2023, 3:34am

Hi @alaapdhall79 ,
I am checking on this. please allow me some time.
Thank you for your patience.

vahidreza.mousavi · September 15, 2023, 10:40am

Hi @AakankshaS
Any update on the followup of the matter?
@alaapdhall79 could you solve your issue?

alaapdhall79 · September 15, 2023, 10:46am

Yes, it was thermal throttling that caused the slowdown and did had nothing to do with drivers.
p2p is indeed disabled on 40/30 series cards but that won’t give major slowdowns. P2P is only available in industry-level cards like A6000, A100, etc afaik.
Thanks.

vahidreza.mousavi · September 15, 2023, 11:45am

thanks, that was helpful. actually im looking for p2p solutions and as far as i’ve checked and you also verified the 4090 doesn’t have it. im looking at A6000 now as the least expensive one.

Topic		Replies	Views
Dual RTX 4090 with distributed training CUDA Programming and Performance cuda , pytorch , deep-learning	3	5128	May 31, 2024
Standard nVidia CUDA tests fail with dual RTX 4090 Linux box Linux	54	22730	April 29, 2024
Parallel training with 4 cards 4090 cannot be performed on AMD 5975WX， stuck at the beginning CUDA Programming and Performance	13	6311	February 6, 2023
Training multiple models on multiple GPUs hangs Frameworks (archived) pytorch	0	889	February 19, 2021
我用的CPU是AMD 5975WX，显卡是4块4090。cuda版本为cuda12，pytoch版本为2.0 Linux chinese	6	2135	December 29, 2022
P2P issue using two RTX 5090 GPUs CUDA Programming and Performance cuda , kernel , ubuntu , linux-driver	11	4067	March 16, 2025
4090 lower performance than 3090 CUDA Linux pcie , hw , cuda , tensorflow , cudnn	3	1213	January 30, 2024
PyTorch CUDA Errors on Ubuntu 22 with RTX 3090 Ti x2 CUDA Setup and Installation cuda , ubuntu , pytorch , python	4	4862	April 29, 2023
[CRASH!] System crash/reboot on RTX 4090 GPU - Hardware boot , cuda	1	1277	May 8, 2023
Why 4090 training slower than P100 even writing same piece of code? GPU - Hardware	2	2571	February 20, 2025

Performance Slowdown during Distributed Training with 4x RTX 4090 GPUs

Related topics