Hello,
I have an HPE server with 4x L40s, 2 AMD EPYC 9124 server with 16 cores and SMT disabled. When I try to run a process with more than one GPU (using model = nn.DataParallel(model)
but doesnt work too with tensorflow), it never completes, although it works with a single GPU.
The symptom is that via nvidia-smi, we see:
- either 100% utilization of a single card and 0% on the others, while the power and memory usage increase on all of them
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15 Driver Version: 570.86.15 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L40S Off | 00000000:02:00.0 Off | 0 |
| N/A 49C P0 86W / 350W | 707MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA L40S Off | 00000000:64:00.0 Off | 0 |
| N/A 47C P0 85W / 350W | 711MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA L40S Off | 00000000:82:00.0 Off | 0 |
| N/A 50C P0 86W / 350W | 711MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA L40S Off | 00000000:E3:00.0 Off | 0 |
| N/A 62C P0 116W / 350W | 595MiB / 46068MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 49770 C /usr/bin/python 698MiB |
| 1 N/A N/A 49770 C /usr/bin/python 702MiB |
| 2 N/A N/A 49770 C /usr/bin/python 702MiB |
| 3 N/A N/A 49770 C /usr/bin/python 586MiB |
+-----------------------------------------------------------------------------------------+
- either 3 cards at 100% utilization and 1 at 0%, with power and memory usage increasing on all of them.
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15 Driver Version: 570.86.15 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L40S Off | 00000000:02:00.0 Off | 0 |
| N/A 52C P0 100W / 350W | 707MiB / 46068MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA L40S Off | 00000000:64:00.0 Off | 0 |
| N/A 47C P0 85W / 350W | 711MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA L40S Off | 00000000:82:00.0 Off | 0 |
| N/A 52C P0 100W / 350W | 711MiB / 46068MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA L40S Off | 00000000:E3:00.0 Off | 0 |
| N/A 62C P0 116W / 350W | 691MiB / 46068MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 50278 C /usr/bin/python 698MiB |
| 1 N/A N/A 50278 C /usr/bin/python 702MiB |
| 2 N/A N/A 50278 C /usr/bin/python 702MiB |
| 3 N/A N/A 50278 C /usr/bin/python 682MiB |
+-----------------------------------------------------------------------------------------+
When I run it with my SLURM controller and JupyterHub, and limit it to two GPUs, it still doesn’t work. I’m sure it’s not just one card causing the problem, because I can run batches of two and none of them work (nvidia0 + nvidia1, nvidia2 + nvidia3).
Thank you in advance for your help.
If this post is not in the right place, please let me know.
Best regards.
nvidia-bug-report.log.gz (1.1 MB)