Machine Specs:
GPU: Nvidia Grace Hopper (GH100 with 97GB memory)
Driver Version: 560.35.03
CUDA Version: 12.6
Issue Description:
I am running a script that starts an MPS server using the environment variables CUDA_MPS_PIPE_DIRECTORY and CUDA_MPS_LOG_DIRECTORY in a /tmp/pipe/timestamp and tmp/log/timestamp format, ensuring each MPS server logs and pipes to a unique folder.
After launching the MPS server, my script starts 25 clients that use MPS.
The issue arises when I run the script twice, launching 2 MPS servers with 30 clients each (total: 2 MPS servers, 60 clients). At this point, I encounter the following error:
Cudamemcpy failed: The remote procedural call between the MPS server and the MPS client failed (error code 806)
I have verified that GPU memory is not exhausted. This error only occurs when running 2 scripts (2 MPS servers). Even if I reduce the number of clients per MPS server, the error persists.
Questions:
-
Is there any known limitation or conflict when running multiple MPS servers? (I know the limit is 48 per Gpu per MPS)
-
Could this error be caused by how CUDA manages multiple MPS instances?
-
Are there any debugging steps or configurations I should check?
Any insights would be greatly appreciated. Thanks in advance!