problem with multi gpu using mpi

Hi

I am trying to implement multi-GPU support for my program using MPI.
However, on my host with two GPU cards I see the following output of nvidia-smi that seems to indicate that one process accessing both cards instead of each process accessing only one card.

Mon Nov 30 14:05:55 2015
+------------------------------------------------------+
| NVIDIA-SMI 340.29     Driver Version: 340.29         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20Xm         Off  | 0000:0A:00.0     Off |                    0 |
| N/A   37C    P0    73W / 235W |     98MiB /  5759MiB |     32%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20Xm         Off  | 0000:0D:00.0     Off |                    0 |
| N/A   35C    P0    75W / 235W |    169MiB /  5759MiB |     52%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0      1431  ../../../../bin/mytest                                  81MiB |
|    1      1431  ../../../../bin/mytest                                  69MiB |
|    1      1430  ../../../../bin/mytest                                  81MiB |
+-----------------------------------------------------------------------------+

However, when I look at the output of the following code, the assigning of the different cards seems to be correct, i.e one process gets device_num 0 and the other gets device_num 1.


    int acc_dev_id = -1;
    acc_device_t dev_type = acc_get_device_type();
    int num_devs = acc_get_num_devices(dev_type);
    MPI::COMM_WORLD.Gather(&num_devs,1,MPI::INT,recv_num_devs,1,MPI::INT,0);
/*
some code to read out recv_num_devs and fill array proc_accelerator_id with the proper device num for each rank taking into account that processes could be on the same host or on different nodes each with one or more gpus attached
*/

MPI::COMM_WORLD.Scatter(&proc_accelerator_id[0],1,MPI::INT,&acc_dev_id,1,MPI::INT,0);
acc_set_device_num(acc_dev_id,dev_type);
std::cout << "OpenACC device type for process " << pid << ": " 
    << dev_type << "\n";
acc_init(dev_type);
 std::cout << "OpenACC device number that will be used for process "
        << pid << ": " << acc_get_device_num(dev_type) << "\n";

So I am wondering what I am doing wrong here. Any suggestions where to look at?

Thanks,
LS

Hi LS,

In order to maintain interoperability with CUDA, we need to check if a CUDA context has already been created. If so, then we attach to this context. The problem being that starting in CUDA 7.0, there’s a default context. This causes the side-effect of the OpenACC runtime attaching to it, thus showing up as this extra context on the default device.

It’s relatively benign and I doubt it’s causing you any issues. However, I went ahead and opened a problem report (TPR#22133) since it shouldn’t be occurring and does take up a bit of space.

Best regards,
Mat

Hi Mat

thanks for the background. The reason I wanted to clarify this issue was that I got wrong results for my multi-gpu program. One thought was that something could go wrong if a single process is computing things on two cards while it was only intended to interact with one. I wanted to exclude this error source before digging further into the code to find out what else could have gone wrong.
In this particular case I subdivide my domain into two subdomains and assign a different device to be used for each of them. Looking at the results it seems that the computation has only worked out for one subdomain, i.e. one process, while the other contains only zeros.

Thanks,
LS