Hey, I am currently experiencing a considerable training slowdown while implementing Distributed Training with four RTX 4090 GPUs on various Computer Vision models, such as YOLO and ResNet50. After initiating the training process, I observed a significant drop in power usage from 450W to around 80-90W within just a few seconds, resulting in the training becoming approximately 6-8 times slower.
My setup is liquid-cooled and I’ve verified that the temperature levels for both CPU and GPU are within the acceptable range, so overheating can be ruled out. To further diagnose this issue, I conducted training with Large Model Support (LLMS) and performed a GPU-Burn test. The GPUs operated normally, sustaining 450W power usage without any significant increase in temperature. This peculiar slowdown appears to be exclusive to training Computer Vision models, particularly when the data is distributed across multiple GPUs.
Moreover, I have executed individual tests for CPU, GPU, and I/O and everything is working fine stand-alone, however as soon as you start training a CV model problem occurs. I tried different versions of YOLOs, Resnet50, and noticed similar behavior for all.
System Specs:
CPU: AMD RYZEN Threadripper Pro 5965WX 24-Core 3.80 GHz (Threadripper PRO 2023)
GPU: 4 x Liquid-cooled NVIDIA RTX 4090 24 GB
NVME SSD
OS: Ubuntu 22.04
Nvidia-driver: 525.105.17
I’ve done some research and found that numerous people seem to be encountering similar issues and pointing out that p2p is disabled on Nvidia 40 series Cards:
- Standard nVidia CUDA tests fail with dual RTX 4090 Linux box - #53 by anto_4090
- Parallel training with 4 cards 4090 cannot be performed on AMD 5975WX, stuck at the beginning - #10 by jaybob20
- DDP training on RTX 4090 (ADA, cu118) - #6 by Tim_Hanson - distributed - PyTorch Forums
- Problems With RTX4090 MultiGPU and AMD vs Intel vs RTX6000Ada or RTX3090 | Puget Systems
I’m considering if the peer-to-peer (p2p) aspect could contribute to this problem. If so, is it possible that an Nvidia driver software update could rectify this issue, or might this be a permanent limitation that hinders the utility of a quad 4090 GPU setup for Deep Learning applications in the future?
I would be grateful for confirmation on whether RTX 4090 GPUs with Distributed Data-Parallel (DDP) are suitable for Deep Learning or if this is more likely to be a software-related issue. I’ve tried using the NVIDIA NGC Docker container with PyTorch2 and CUDA 11.8, as well as the TensorFlow Docker for the ResNet50 benchmark, but neither provided a solution.
Your assistance in this matter is greatly appreciated.