MPI running issue using NVIDIA MPS Service on Multi-GPU nodes

am having problem when running MPI codes using NVIDIA MPS Service on multi-GPU nodes.

The system that I am using has 2 K80 GPUs (total of 4 GPUs).

Basically, I first set the GPU mode to exclusive_process:

nvidia_smi -c 3
Then I start the MPS Service:

nvidia-cuda-mps-control -d
When I increase the number of processes and run my code I get the following error:

all CUDA-capable devices are busy or unavailable
Here is an example:

This is my code:

#include <stdio.h>
#include <stdlib.h>
#include "cuda_runtime.h"
#include "mpi.h"
#define __SIZE__ 1024

int main(int argc, char **argv)
{

    cudaError_t cuda_err = cudaSuccess;
    void *dev_buf;

    MPI_Init(&argc, &argv);

    int my_rank = -1;
    int dev_cnt = 0;
    int dev_id = -1;

    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

    cuda_err = cudaGetDeviceCount(&dev_cnt);
    if (cuda_err != cudaSuccess)
        printf("cudaGET Error--on rank %d %s\n", my_rank, cudaGetErrorString(cuda_err));

    dev_id = my_rank % dev_cnt;

    printf("myrank=%d dev_cnt=%d, dev_id=%d\n", my_rank, dev_cnt, dev_id);

    cuda_err = cudaSetDevice(dev_id);
    if (cuda_err != cudaSuccess)
        printf("cudaSet Error--on rank %d %s\n", my_rank, cudaGetErrorString(cuda_err));

    cuda_err = cudaMalloc((void **) &dev_buf, __SIZE__);
    if (cuda_err != cudaSuccess)
        printf("cudaMalloc Error--on rank %d %s\n", my_rank, cudaGetErrorString(cuda_err))
    else
        printf("cudaMalloc Success++, %d \n", my_rank);


    MPI_Finalize();
    return 0;
}

Here is the output for 12 processes:

#mpirun -n 12 -hostfile hosts ./hq_test

myrank=0 dev_cnt=4, dev_id=0
myrank=1 dev_cnt=4, dev_id=1
myrank=2 dev_cnt=4, dev_id=2
myrank=3 dev_cnt=4, dev_id=3
myrank=4 dev_cnt=4, dev_id=0
myrank=5 dev_cnt=4, dev_id=1
myrank=6 dev_cnt=4, dev_id=2
myrank=7 dev_cnt=4, dev_id=3
myrank=8 dev_cnt=4, dev_id=0
myrank=9 dev_cnt=4, dev_id=1
myrank=10 dev_cnt=4, dev_id=2
myrank=11 dev_cnt=4, dev_id=3
cudaMalloc Success++, 8
cudaMalloc Success++, 10
cudaMalloc Success++, 0
cudaMalloc Success++, 1
cudaMalloc Success++, 3
cudaMalloc Success++, 7
cudaMalloc Success++, 9
cudaMalloc Success++, 6
cudaMalloc Success++, 4
cudaMalloc Success++, 2
cudaMalloc Success++, 5
cudaMalloc Success++, 11
Here is the output for 14 processes:

#mpirun -n 14 -hostfile hosts ./hq_test

myrank=0 dev_cnt=4, dev_id=0
myrank=1 dev_cnt=4, dev_id=1
myrank=2 dev_cnt=4, dev_id=2
myrank=3 dev_cnt=4, dev_id=3
myrank=4 dev_cnt=4, dev_id=0
myrank=5 dev_cnt=4, dev_id=1
myrank=6 dev_cnt=4, dev_id=2
myrank=7 dev_cnt=4, dev_id=3
myrank=8 dev_cnt=4, dev_id=0
myrank=9 dev_cnt=4, dev_id=1
myrank=10 dev_cnt=4, dev_id=2
myrank=11 dev_cnt=4, dev_id=3
myrank=12 dev_cnt=4, dev_id=0
myrank=13 dev_cnt=4, dev_id=1
cudaMalloc Success++, 11
cudaMalloc Success++, 3
cudaMalloc Success++, 7
cudaMalloc Success++, 2
cudaMalloc Success++, 10
cudaMalloc Success++, 6
cudaMalloc Success++, 1
cudaMalloc Success++, 8
cudaMalloc Error–on rank 13 all CUDA-capable devices are busy or unavailable
cudaMalloc Error–on rank 5 all CUDA-capable devices are busy or unavailable
cudaMalloc Error–on rank 9 all CUDA-capable devices are busy or unavailable
cudaMalloc Error–on rank 4 all CUDA-capable devices are busy or unavailable
cudaMalloc Error–on rank 0 all CUDA-capable devices are busy or unavailable
cudaMalloc Error–on rank 12 all CUDA-capable devices are busy or unavailable
Note: I have already tried changing CUDA_DEVICE_MAX_CONNECTIONS value, but it didn’t help.

I’d appreciate if you share your thoughts on this with me.

what do you have your CUDA_VISIBLE_DEVICES environment variable set to (if anything) before launching the server daemon?

Please provide the MPS server log for the above failing case.

Hi there, and thanks for your reply.

Here is the script used to start the MPS Server. I just replaced the actual UUID numbers with UUID2 to UUID5 here.

export CUDA_VISIBLE_DEVICES=GPU-UUID2#,GPU-UUID3#,GPU-UUID4#,GPU-UUID5#
export CUDA_MPS_PIPE_DIRECTORY=/home/imfaraji/tmp/pipe3
export CUDA_MPS_LOG_DIRECTORY=/home/imfaraji/tmp/log3

nvidia-cuda-mps-control -d

Here is the output of nvidia-smi:

Fri Sep 16 08:32:06 2016
±-----------------------------------------------------+
| NVIDIA-SMI 352.39 Driver Version: 352.39 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:07:00.0 Off | 0 |
| N/A 32C P8 26W / 149W | 22MiB / 11519MiB | 0% E. Process |
±------------------------------±---------------------±---------------------+
| 1 Tesla K80 Off | 0000:08:00.0 Off | 0 |
| N/A 26C P8 28W / 149W | 22MiB / 11519MiB | 0% E. Process |
±------------------------------±---------------------±---------------------+
| 2 Tesla K80 Off | 0000:86:00.0 Off | Off |
| N/A 23C P8 26W / 149W | 115MiB / 12287MiB | 0% E. Process |
±------------------------------±---------------------±---------------------+
| 3 Tesla K80 Off | 0000:87:00.0 Off | Off |
| N/A 31C P8 28W / 149W | 115MiB / 12287MiB | 0% E. Process |
±------------------------------±---------------------±---------------------+
| 4 Tesla K80 Off | 0000:8A:00.0 Off | Off |
| N/A 25C P8 26W / 149W | 115MiB / 12287MiB | 0% E. Process |
±------------------------------±---------------------±---------------------+
| 5 Tesla K80 Off | 0000:8B:00.0 Off | Off |
| N/A 34C P8 28W / 149W | 115MiB / 12287MiB | 0% E. Process |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 2 5659 C nvidia-cuda-mps-server 89MiB |
| 3 5659 C nvidia-cuda-mps-server 89MiB |
| 4 5659 C nvidia-cuda-mps-server 89MiB |
| 5 5659 C nvidia-cuda-mps-server 89MiB |
±----------------------------------------------------------------------------+

Here is the output of server.log

[2016-09-16 08:28:16.870 Other  5659] Start
[2016-09-16 08:28:16.870 Other  5659] Warning: File descriptor limit may be set too low, consider increasing it
[2016-09-16 08:28:18.454 Other  5659] New client 5644 connected
[2016-09-16 08:28:18.455 Other  5659] New client 5634 connected
[2016-09-16 08:28:18.455 Other  5659] New client 5639 connected
[2016-09-16 08:28:18.455 Other  5659] New client 5638 connected
[2016-09-16 08:28:18.455 Other  5659] New client 5643 connected
[2016-09-16 08:28:18.455 Other  5659] New client 5641 connected
[2016-09-16 08:28:18.455 Other  5659] New client 5636 connected
[2016-09-16 08:28:18.456 Other  5659] New client 5640 connected
[2016-09-16 08:28:18.456 Other  5659] New client 5637 connected
[2016-09-16 08:28:18.456 Other  5659] New client 5632 connected
[2016-09-16 08:28:18.456 Other  5659] New client 5635 connected
[2016-09-16 08:28:18.456 Other  5659] New client 5642 connected
[2016-09-16 08:28:18.456 Other  5659] New client 5631 connected
[2016-09-16 08:28:18.457 Other  5659] New client 5633 connected
[2016-09-16 08:28:18.968 Other  5659] MPS Server failed to create/open SHM segment.
[2016-09-16 08:28:18.968 Other  5659] MPS Server failed to create/open SHM segment.
[2016-09-16 08:28:18.968 Other  5659] MPS Server failed to create/open SHM segment.
[2016-09-16 08:28:18.969 Other  5659] MPS Server failed to create/open SHM segment.
[2016-09-16 08:28:18.969 Other  5659] MPS Server failed to create/open SHM segment.
[2016-09-16 08:28:18.969 Other  5659] MPS Server failed to create/open SHM segment.
[2016-09-16 08:28:18.969 Other  5659] MPS Server failed to create/open SHM segment.
[2016-09-16 08:28:19.005 Other  5659] Client 5631 disconnected
[2016-09-16 08:28:19.008 Other  5659] Client 5635 disconnected
[2016-09-16 08:28:19.011 Other  5659] Client 5636 disconnected
[2016-09-16 08:28:19.013 Other  5659] Client 5640 disconnected
[2016-09-16 08:28:19.015 Other  5659] Client 5643 disconnected
[2016-09-16 08:28:19.018 Other  5659] Client 5644 disconnected
[2016-09-16 08:28:19.022 Other  5659] Client 5639 disconnected
[2016-09-16 08:28:19.053 Other  5659] Client 5632 disconnected
[2016-09-16 08:28:19.055 Other  5659] Client 5638 disconnected
[2016-09-16 08:28:19.055 Other  5659] Client 5641 disconnected
[2016-09-16 08:28:19.063 Other  5659] Client 5642 disconnected
[2016-09-16 08:28:19.064 Other  5659] Client 5633 disconnected
[2016-09-16 08:28:19.071 Other  5659] Client 5634 disconnected
[2016-09-16 08:28:19.071 Other  5659] Client 5637 disconnected

Have you tried addressing this warning:

If so, did it have any effect on the observed behavior?

regarding this:

MPS Server failed to create/open SHM segment.

Please read the CUDA MPS documentation:

https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf

section 4.4:

“Memory allocation API calls (including context creation) may fail with the following
message in the server log: MPS Server failed to create/open SHM segment.
Comments: This is most likely due to exhausting the file descriptor limit on your
system. Check the maximum number of open file descriptors allowed on your
system and increase if necessary. We recommend setting it to 16384 and higher.
Typically this information can be checked via the command ‘ulimit –n’; refer to your
operating system instructions on how to change the limit.”