MPS Server is working with a single node multi-GPU but not working with two nodes multi-GPU

Hello,
I am working on developing a large GPU-enabled code that benefits from running several MPI processes per GPU. Inspired by the following lecture, I want to utilize MPS for better aligning GPU kernels launched from different MPI processes per each GPU. I have read the MPS documentation and followed the process for a single user system (5.1.2). First, I tried to run 4 MPI processes per GPU using a single node (a total of 32 MPI for 8 GPUs).

To do so, I logged in as root and changed the mode of each GPU to exclusive process (the number was changed from 0 to 7 to cover all 8 GPUs in the node):

nvidia-smi -i 0 -c EXCLUSIVE_PROCESS

Then, in a new session (terminal window) logged in as a user to the node, I exported the paths for storing the logs as the following:

export CUDA_MPS_PIPE_DIRECTORY=/home/username/mps_logs/pgpu02
export CUDA_MPS_LOG_DIRECTORY=/home/username/mps_logs/pgpu02

Then, I initialized the MPS daemon by executing:

nvidia-cuda-mps-control -d

And started the MPS server by running the following:

nvidia-cuda-mps-control
start_server -uid 1003

The code was executed via the following line:

mpirun --bind-to none -np 32 sh gpu_script_rank.sh executable_file input_file.inp

Where the script is used for making each MPI process see only one GPU to prevent redundant allocation of junk memory on GPU #0. Just in case, the script content is below:

#!/bin/bash
let ngpus=8
if [[ -n ${OMPI_COMM_WORLD_LOCAL_RANK} ]]
then
let lrank=${OMPI_COMM_WORLD_LOCAL_RANK}
let device=$(($lrank % $ngpus))
export CUDA_VISIBLE_DEVICES=$device
fi
echo $lrank $device $CUDA_VISIBLE_DEVICES
echo "$@"
# x
"$@"

As you may have noticed until this point, I have not assigned any value to CUDA_VISIBLE_DEVICES before starting the MPS server, because I want to use all GPUs available in the system.

The result of the execution is shown in the following nvidia-smi plot:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A5000               On  | 00000000:01:00.0 Off |                  Off |
| 30%   38C    P8              23W / 230W |   1245MiB / 24564MiB |      1%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000               On  | 00000000:25:00.0 Off |                  Off |
| 30%   39C    P8              22W / 230W |   1255MiB / 24564MiB |      2%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A5000               On  | 00000000:41:00.0 Off |                  Off |
| 30%   38C    P8              25W / 230W |   1245MiB / 24564MiB |      3%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A5000               On  | 00000000:61:00.0 Off |                  Off |
| 30%   37C    P8              20W / 230W |   1247MiB / 24564MiB |      3%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA RTX A5000               On  | 00000000:81:00.0 Off |                  Off |
| 30%   38C    P8              22W / 230W |   1245MiB / 24564MiB |      2%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA RTX A5000               On  | 00000000:A1:00.0 Off |                  Off |
| 30%   38C    P8              27W / 230W |   1247MiB / 24564MiB |      1%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA RTX A5000               On  | 00000000:C1:00.0 Off |                  Off |
| 30%   37C    P8              18W / 230W |   1255MiB / 24564MiB |      0%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA RTX A5000               On  | 00000000:E1:00.0 Off |                  Off |
| 30%   37C    P8              22W / 230W |   1247MiB / 24564MiB |      1%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    283318      C   nvidia-cuda-mps-server                       28MiB |
|    0   N/A  N/A    283392    M+C   .../executable_file                         258MiB |
|    0   N/A  N/A    283403    M+C   .../executable_file                         258MiB |
|    0   N/A  N/A    283421    M+C   .../executable_file                         258MiB |
|    0   N/A  N/A    283440    M+C   .../executable_file                         258MiB |
|    1   N/A  N/A    283318      C   nvidia-cuda-mps-server                       28MiB |
|    1   N/A  N/A    283394    M+C   .../executable_file                         258MiB |
|    1   N/A  N/A    283408    M+C   .../executable_file                         258MiB |
|    1   N/A  N/A    283425    M+C   .../executable_file                         258MiB |
|    1   N/A  N/A    283444    M+C   .../executable_file                         258MiB |
|    2   N/A  N/A    283318      C   nvidia-cuda-mps-server                       28MiB |
|    2   N/A  N/A    283391    M+C   .../executable_file                         258MiB |
|    2   N/A  N/A    283407    M+C   .../executable_file                         258MiB |
|    2   N/A  N/A    283423    M+C   .../executable_file                         258MiB |
|    2   N/A  N/A    283442    M+C   .../executable_file                         258MiB |
|    3   N/A  N/A    283318      C   nvidia-cuda-mps-server                       28MiB |
|    3   N/A  N/A    283396    M+C   .../executable_file                         258MiB |
|    3   N/A  N/A    283413    M+C   .../executable_file                         258MiB |
|    3   N/A  N/A    283428    M+C   .../executable_file                         258MiB |
|    3   N/A  N/A    283450    M+C   .../executable_file                         258MiB |
|    4   N/A  N/A    283318      C   nvidia-cuda-mps-server                       28MiB |
|    4   N/A  N/A    283398    M+C   .../executable_file                         258MiB |
|    4   N/A  N/A    283411    M+C   .../executable_file                         258MiB |
|    4   N/A  N/A    283429    M+C   .../executable_file                         258MiB |
|    4   N/A  N/A    283453    M+C   .../executable_file                         258MiB |
|    5   N/A  N/A    283318      C   nvidia-cuda-mps-server                       28MiB |
|    5   N/A  N/A    283402    M+C   .../executable_file                         258MiB |
|    5   N/A  N/A    283416    M+C   .../executable_file                         258MiB |
|    5   N/A  N/A    283435    M+C   .../executable_file                         258MiB |
|    5   N/A  N/A    283455    M+C   .../executable_file                         258MiB |
|    6   N/A  N/A    283318      C   nvidia-cuda-mps-server                       28MiB |
|    6   N/A  N/A    283400    M+C   .../executable_file                         258MiB |
|    6   N/A  N/A    283417    M+C   .../executable_file                         258MiB |
|    6   N/A  N/A    283431    M+C   .../executable_file                         258MiB |
|    6   N/A  N/A    283452    M+C   .../executable_file                         258MiB |
|    7   N/A  N/A    283318      C   nvidia-cuda-mps-server                       28MiB |
|    7   N/A  N/A    283405    M+C   .../executable_file                         258MiB |
|    7   N/A  N/A    283419    M+C   .../executable_file                         258MiB |
|    7   N/A  N/A    283438    M+C   .../executable_file                         258MiB |
|    7   N/A  N/A    283458    M+C   .../executable_file                         258MiB |
+---------------------------------------------------------------------------------------+

The MPS server works perfectly for a single-node use-case, and I am very happy with the performance improvement it offered. Now, I want to use it for running a 2-node calculation, where both nodes are identical in terms of specs. To enable MPS daemon for each node, I created a separate log folder for each node and exported it when logged to each node separately. Then, I launched the MPS daemon as explained above for each node, and started a MPS server, which showed in the nvidia-smi process list for each node. GPUs in both nodes were set to the exclusive process mode. To run the code, I have to be in the master node of the cluster, and I ran the following command:

mpirun --bind-to none -N 16 -x UCX_ERROR_SIGNALS="" -hostfile hostfile_gpu sh gpu_script_rank.sh executable_file input_file.inp

Where the hostfile content is as simple as:

pgpu02
pgpu03

This execution crashed, and the following error message appeared shortly after the code started:

Failing in Thread:1
Accelerator Fatal Error: call to cuCtxCreate returned error 46: Other

It was duplicated 32 times (the same number as the number of MPI processes).
I was able to reproduce the same error by running a single node calculation from a new session, and it was solved by exporting the log directories as shown above. This gives me a hint that the problem is that the master node is not aware where the logs of MPS server are stored. I tried exporting the paths in various ways, including the combined way as shown below:

export CUDA_MPS_PIPE_DIRECTORY=/home/username/mps_logs/pgpu02:/home/username/mps_logs/pgpu03
export CUDA_MPS_LOG_DIRECTORY=/home/username/mps_logs/pgpu02:/home/username/mps_logs/pgpu03

But the error stayed the same. I also tried to change one node GPUs to the default mode and only export the path of a single node, but still the calculation crashed with the same error (though, the number of duplicate messages was reduced by half). When changing the GPU mode to default for all GPUs, the code runs well, but does not utilize the launched MPS server, meaning there is no performance difference that I observed for a single node execution. Below is the nvidia-smi page for that case:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A5000               On  | 00000000:01:00.0 Off |                  Off |
| 30%   40C    P2              61W / 230W |    554MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000               On  | 00000000:25:00.0 Off |                  Off |
| 30%   41C    P2              59W / 230W |    554MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A5000               On  | 00000000:41:00.0 Off |                  Off |
| 30%   41C    P2              62W / 230W |    554MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A5000               On  | 00000000:61:00.0 Off |                  Off |
| 30%   39C    P2              57W / 230W |    554MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA RTX A5000               On  | 00000000:81:00.0 Off |                  Off |
| 30%   40C    P2              60W / 230W |    554MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA RTX A5000               On  | 00000000:A1:00.0 Off |                  Off |
| 30%   41C    P2              68W / 230W |    554MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA RTX A5000               On  | 00000000:C1:00.0 Off |                  Off |
| 30%   39C    P2              57W / 230W |    554MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA RTX A5000               On  | 00000000:E1:00.0 Off |                  Off |
| 30%   39C    P2              59W / 230W |    554MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    287261      C   nvidia-cuda-mps-server                       28MiB |
|    0   N/A  N/A    287403      C   ...executable_file                          258MiB |
|    0   N/A  N/A    287419      C   ...executable_file                          258MiB |
|    1   N/A  N/A    287261      C   nvidia-cuda-mps-server                       28MiB |
|    1   N/A  N/A    287406      C   ...executable_file                          258MiB |
|    1   N/A  N/A    287421      C   ...executable_file                          258MiB |
|    2   N/A  N/A    287261      C   nvidia-cuda-mps-server                       28MiB |
|    2   N/A  N/A    287408      C   ...executable_file                          258MiB |
|    2   N/A  N/A    287423      C   ...executable_file                          258MiB |
|    3   N/A  N/A    287261      C   nvidia-cuda-mps-server                       28MiB |
|    3   N/A  N/A    287409      C   ...executable_file                          258MiB |
|    3   N/A  N/A    287426      C   ...executable_file                          258MiB |
|    4   N/A  N/A    287261      C   nvidia-cuda-mps-server                       28MiB |
|    4   N/A  N/A    287411      C   ...executable_file                          258MiB |
|    4   N/A  N/A    287428      C   ...executable_file                          258MiB |
|    5   N/A  N/A    287261      C   nvidia-cuda-mps-server                       28MiB |
|    5   N/A  N/A    287413      C   ...executable_file                          258MiB |
|    5   N/A  N/A    287430      C   ...executable_file                          258MiB |
|    6   N/A  N/A    287261      C   nvidia-cuda-mps-server                       28MiB |
|    6   N/A  N/A    287415      C   ...executable_file                          258MiB |
|    6   N/A  N/A    287432      C   ...executable_file                          258MiB |
|    7   N/A  N/A    287261      C   nvidia-cuda-mps-server                       28MiB |
|    7   N/A  N/A    287417      C   ...executable_file                          258MiB |
|    7   N/A  N/A    287435      C   ...executable_file                          258MiB |
+---------------------------------------------------------------------------------------+

As you can see, the process type here is “C” for all processes, while in the working single node case it was “M+C” for the processes created by my code. The compiler I use is NVIDIA HPC SDK 24.1, and my code is written in Fortran. For MPI, I use the SDK built-in OpenMPI 4.1.7 that comes with SDK built-in HPC-X. Below, I attach the server and control logs for 3 tested scenarios.
1_node_working___control.log (15.3 KB)
1_node_working___server.log (30.3 KB)
2_nodes_not_working___control.log (660 Bytes)
2_nodes_not_working___server.log (1.3 KB)
2_nodes_working_but_not_using_mps___control.log (933 Bytes)
2_nodes_working_but_not_using_mps___server.log (1.3 KB)

My question is: how can I set up and use the MPS server for multi-node multi-GPU scenario, given what I have tried so far?

P. S. I noticed a topic on a similar matter here, where the person was using the same GPU driver as shown in my nvidia-smi, but if possible, I would prefer not to upgrade the driver unless it is a confirmed bug that can only be solved by doing so.