Training 3x Slower with nvcr.io/nvidia/pytorch:23.08-py3 Compared to 22.04-py3

jpitoskas · September 11, 2024, 7:07pm

I am training an AI model using PyTorch with NVIDIA NGC base images. I’ve been using the nvcr.io/nvidia/pytorch:22.04-py3 image successfully, but after upgrading to nvcr.io/nvidia/pytorch:23.08-py3, I’ve noticed a 3x decrease in training speed. The problem occurs across two different machines with different GPUs, but both exhibit the same behavior.

Machine 1 Setup:

2x NVIDIA RTX 4090 GPUs
CUDA version: 12.2
Driver version: 535.183.01

Machine 2 Setup:

2x NVIDIA RTX 3090 GPUs
CUDA version: 12.2
Driver version: 535.183.01

What I’ve Observed:

In both cases, checking nvidia-smi shows that the GPUs are active and under load, with GPU utilization around 50-80%.
Despite the GPUs appearing to be running normally, training speed in 23.08 is significantly slower than in 22.04.

What I’ve Tried:

Switching back to 22.04: Training speeds return to normal, indicating the issue is tied to 23.08.
Verifying GPU Usage: nvidia-smi shows that the GPUs are being fully utilized in both versions, but the speed difference remains.
Checking CUDA versions: Both images use CUDA 12.2, and the drivers are the same across both versions.

Request for Help: Has anyone experienced similar issues with the 23.08 image? Could this be related to changes in PyTorch, CUDA, or other dependencies in the newer image? Any suggestions or known issues would be greatly appreciated.

Topic		Replies	Views
Slower Speeds with Different NVIDIA Image Container: CUDA cuda , containers	0	693	January 3, 2024
Training a gan on the RTX 30370 with only ~5% utilization CUDA Programming and Performance	3	625	February 11, 2021
My desktop freezes at random times while training with pytorch Frameworks cuda , ubuntu , pytorch	3	1580	March 11, 2024
Performance Slowdown during Distributed Training with 4x RTX 4090 GPUs cuDNN cuda , pytorch , ai-training , gpu	6	4372	September 29, 2023
Slow training of neural networks on GPU CUDA Programming and Performance	17	3971	April 21, 2021
Best Nvidia Driver for PyTorch? I'm crashing :-( AI Foundation Models and Endpoints pytorch , drivers	1	527	March 17, 2024
Training multiple models on multiple GPUs hangs Frameworks pytorch	0	826	February 19, 2021
CUDA and pytorch versions mismatch CUDA Setup and Installation	0	1612	September 21, 2023
train speed slow after upgrade cuda CUDA Programming and Performance	1	847	November 14, 2019
NVCaffe docker memory leak when using pycaffe Docker and NVIDIA Docker	0	1165	May 10, 2018

Training 3x Slower with nvcr.io/nvidia/pytorch:23.08-py3 Compared to 22.04-py3

Related topics