Deploying Pytorch on MPS in multi-GPU machines

Hello

I’m trying to start a PyTorch training session on top of of multi-GPU machines with MPS. Previous I was able to deploy MPS on a machine with one GPU. I used the same process on a multi GPU machine and I’m getting an output that looks like:

±----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26 Driver Version: 430.26 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2080 Off | 00000000:19:00.0 Off | N/A |
| 22% 54C P2 156W / 215W | 2586MiB / 7982MiB | 87% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce RTX 2080 Off | 00000000:1A:00.0 Off | N/A |
| 23% 51C P2 148W / 215W | 2500MiB / 7982MiB | 85% Default |
±------------------------------±---------------------±---------------------+
| 2 GeForce RTX 2080 Off | 00000000:67:00.0 Off | N/A |
| 23% 54C P2 179W / 215W | 2486MiB / 7982MiB | 84% Default |
±------------------------------±---------------------±---------------------+
| 3 GeForce RTX 2080 Off | 00000000:68:00.0 Off | N/A |
| 25% 61C P2 152W / 215W | 2494MiB / 7981MiB | 85% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 84541 C nvidia-cuda-mps-server 25MiB |
| 0 88194 C python3 2549MiB |
| 1 84541 C nvidia-cuda-mps-server 25MiB |
| 1 88194 C python3 2463MiB |
| 2 84541 C nvidia-cuda-mps-server 25MiB |
| 2 88194 C python3 2449MiB |
| 3 84541 C nvidia-cuda-mps-server 25MiB |
| 3 88194 C python3 2457MiB |
±----------------------------------------------------------------------------+

Switching to one device also didn’t work as the output looks like this

Fri Mar 13 00:05:02 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26 Driver Version: 430.26 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2080 Off | 00000000:19:00.0 Off | N/A |
| 23% 34C P2 49W / 215W | 900MiB / 7982MiB | 6% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce RTX 2080 Off | 00000000:1A:00.0 Off | N/A |
| 23% 36C P8 21W / 215W | 35MiB / 7982MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 GeForce RTX 2080 Off | 00000000:67:00.0 Off | N/A |
| 23% 37C P8 7W / 215W | 35MiB / 7982MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 GeForce RTX 2080 Off | 00000000:68:00.0 Off | N/A |
| 23% 42C P8 8W / 215W | 35MiB / 7981MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 84541 C nvidia-cuda-mps-server 25MiB |
| 0 91698 C python3 865MiB |
| 1 84541 C nvidia-cuda-mps-server 25MiB |
| 2 84541 C nvidia-cuda-mps-server 25MiB |
| 3 84541 C nvidia-cuda-mps-server 25MiB |
±----------------------------------------------------------------------------+

It looks like my process can never bind to the server – The server log is also completely empty.

Is there any way to debug this?

Thanks