Training 3x Slower with nvcr.io/nvidia/pytorch:23.08-py3 Compared to 22.04-py3

I am training an AI model using PyTorch with NVIDIA NGC base images. I’ve been using the nvcr.io/nvidia/pytorch:22.04-py3 image successfully, but after upgrading to nvcr.io/nvidia/pytorch:23.08-py3, I’ve noticed a 3x decrease in training speed. The problem occurs across two different machines with different GPUs, but both exhibit the same behavior.

Machine 1 Setup:

  • 2x NVIDIA RTX 4090 GPUs
  • CUDA version: 12.2
  • Driver version: 535.183.01

Machine 2 Setup:

  • 2x NVIDIA RTX 3090 GPUs
  • CUDA version: 12.2
  • Driver version: 535.183.01

What I’ve Observed:

  • In both cases, checking nvidia-smi shows that the GPUs are active and under load, with GPU utilization around 50-80%.
  • Despite the GPUs appearing to be running normally, training speed in 23.08 is significantly slower than in 22.04.

What I’ve Tried:

  1. Switching back to 22.04: Training speeds return to normal, indicating the issue is tied to 23.08.
  2. Verifying GPU Usage: nvidia-smi shows that the GPUs are being fully utilized in both versions, but the speed difference remains.
  3. Checking CUDA versions: Both images use CUDA 12.2, and the drivers are the same across both versions.

Request for Help: Has anyone experienced similar issues with the 23.08 image? Could this be related to changes in PyTorch, CUDA, or other dependencies in the newer image? Any suggestions or known issues would be greatly appreciated.