L40S - Multi GPU doesn't work

Hello,

I have an HPE server with 4x L40s, 2 AMD EPYC 9124 server with 16 cores and SMT disabled. When I try to run a process with more than one GPU (using model = nn.DataParallel(model) but doesnt work too with tensorflow), it never completes, although it works with a single GPU.

The symptom is that via nvidia-smi, we see:

  • either 100% utilization of a single card and 0% on the others, while the power and memory usage increase on all of them
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15              Driver Version: 570.86.15      CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    Off |   00000000:02:00.0 Off |                    0 |
| N/A   49C    P0             86W /  350W |     707MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L40S                    Off |   00000000:64:00.0 Off |                    0 |
| N/A   47C    P0             85W /  350W |     711MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA L40S                    Off |   00000000:82:00.0 Off |                    0 |
| N/A   50C    P0             86W /  350W |     711MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA L40S                    Off |   00000000:E3:00.0 Off |                    0 |
| N/A   62C    P0            116W /  350W |     595MiB /  46068MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           49770      C   /usr/bin/python                         698MiB |
|    1   N/A  N/A           49770      C   /usr/bin/python                         702MiB |
|    2   N/A  N/A           49770      C   /usr/bin/python                         702MiB |
|    3   N/A  N/A           49770      C   /usr/bin/python                         586MiB |
+-----------------------------------------------------------------------------------------+
  • either 3 cards at 100% utilization and 1 at 0%, with power and memory usage increasing on all of them.
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15              Driver Version: 570.86.15      CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    Off |   00000000:02:00.0 Off |                    0 |
| N/A   52C    P0            100W /  350W |     707MiB /  46068MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L40S                    Off |   00000000:64:00.0 Off |                    0 |
| N/A   47C    P0             85W /  350W |     711MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA L40S                    Off |   00000000:82:00.0 Off |                    0 |
| N/A   52C    P0            100W /  350W |     711MiB /  46068MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA L40S                    Off |   00000000:E3:00.0 Off |                    0 |
| N/A   62C    P0            116W /  350W |     691MiB /  46068MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           50278      C   /usr/bin/python                         698MiB |
|    1   N/A  N/A           50278      C   /usr/bin/python                         702MiB |
|    2   N/A  N/A           50278      C   /usr/bin/python                         702MiB |
|    3   N/A  N/A           50278      C   /usr/bin/python                         682MiB |
+-----------------------------------------------------------------------------------------+

When I run it with my SLURM controller and JupyterHub, and limit it to two GPUs, it still doesn’t work. I’m sure it’s not just one card causing the problem, because I can run batches of two and none of them work (nvidia0 + nvidia1, nvidia2 + nvidia3).

Thank you in advance for your help.
If this post is not in the right place, please let me know.
Best regards.
nvidia-bug-report.log.gz (1.1 MB)

1 Like

Are you sure that the other cards are not actually used? I would rather think the 0%/100% to be wrong than the memory consumption and wattage.

Perhaps you could also give one example run of nvidia-smi, where all cards are idle to see memory and power consumption then.

One possibility to investigate would be whether P2P is working on the platform. If it isn’t, there’s a good chance that may interfere with a workload shared across GPUs.

p2pBandwidthLatencyTest sample code is usually a good indicator/check for this. probably running nvidia-smi topo -m will be a good piece of info as well. You can find plenty of forum articles discussing these topics.

1 Like

here without process launched

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15              Driver Version: 570.86.15      CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    Off |   00000000:02:00.0 Off |                    0 |
| N/A   35C    P8             33W /  350W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L40S                    Off |   00000000:64:00.0 Off |                    0 |
| N/A   33C    P8             33W /  350W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA L40S                    Off |   00000000:82:00.0 Off |                    0 |
| N/A   35C    P8             32W /  350W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA L40S                    Off |   00000000:E3:00.0 Off |                    0 |
| N/A   37C    P8             34W /  350W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    SYS     SYS     0-15    0               N/A
GPU1    NODE     X      SYS     SYS     0-15    0               N/A
GPU2    SYS     SYS      X      NODE    16-31   1               N/A
GPU3    SYS     SYS     NODE     X      16-31   1               N/A

Thanks