Hello all,
I have a server with 6 RTX GPU cards.
- Four 4070 SUPER and Two 1080 Ti. The server is Ubuntu 22.04.
- Driver is 550.127.08 and CUDA 12.4
- All cards are detected properly, they list OK in nvidia-smi.
- All cards are on E.Process mode.
I use python to launch 1 CUDA process per card. The program automatically deploys processes to idle cards in round-robin. This works flawlessly on other servers with 8 and 10 cards, but all those cards are the same model (2080 Ti).
So on this particular “hybrid” server, after the first round of executions, one random 4070 card is impossible to reuse and returns an exception to python saying “The requested CUDA device could not be loaded”. All other cards keep working, including the rest of the 4070.
Is also not always the same card that blocks. Sometimes is card 1… other times is card 4, just whatever card was released first. Seems that something, somewhere, is allowing 5 cards simultaneously, but I get to use all cards properly on the first run.
The python function that uses the card is returning and releasing the resource. Nvidia-smi doesn’t report any process running on the card either.
I don’t know where to go from here.
Thanks in advance for any help.
Here is an example of my nvidia-smi, with one of the cards (id 4) being idle and not usable.
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08 Driver Version: 550.127.08 CUDA Version: 12.4 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4070 … On | 00000000:01:00.0 Off | N/A |
| 42% 60C P2 140W / 220W | 259MiB / 12282MiB | 91% E. Process |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 1 NVIDIA GeForce GTX 1080 Ti On | 00000000:1B:00.0 Off | N/A |
| 46% 78C P2 217W / 250W | 181MiB / 11264MiB | 90% E. Process |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 2 NVIDIA GeForce RTX 4070 … On | 00000000:3E:00.0 Off | N/A |
| 40% 59C P2 147W / 220W | 259MiB / 12282MiB | 92% E. Process |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 3 NVIDIA GeForce GTX 1080 Ti On | 00000000:88:00.0 Off | N/A |
| 42% 72C P2 202W / 250W | 181MiB / 11264MiB | 91% E. Process |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 4 NVIDIA GeForce RTX 4070 … On | 00000000:B1:00.0 Off | N/A |
| 0% 31C P8 11W / 220W | 4MiB / 12282MiB | 0% E. Process |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 5 NVIDIA GeForce RTX 4070 … On | 00000000:DA:00.0 Off | N/A |
| 30% 56C P2 148W / 220W | 261MiB / 12282MiB | 92% E. Process |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 388737 C python 252MiB |
| 1 N/A N/A 387742 C python 176MiB |
| 2 N/A N/A 382237 C python 252MiB |
| 3 N/A N/A 381846 C python 176MiB |
| 5 N/A N/A 381796 C python 254MiB |
±----------------------------------------------------------------------------------------+