Hi Mat,
I have investigated a little further and found something strange.
After figuring out what accelerators are available to the various processes running on either the same or different nodes, I run the following code to assign each process its unique device:
#ifdef _OPENACC
MPI::COMM_WORLD.Scatter(&proc_accelerator_id[0],1,MPI::INT,&acc_dev_id,1,MPI::INT,0);
if(MPI::COMM_WORLD.Get_rank()==0){
LogFile << "Accelerator (GPU) IDs: ";
for(size_t i=0;i<proc_accelerator_id.size();++i){
LogFile << proc_accelerator_id[i] << " ";
}
LogFile << "\n";
}
MPI::COMM_WORLD.Barrier();
if(acc_dev_id<num_devs){
acc_set_device_num(acc_dev_id,dev_type);
#pragma acc kernels
{}
#ifdef _DEF_PGI_ACCEL
char device_name[32];
acc_get_current_device_name(device_name,32);
#endif
std::cout << "Accelerator (GPU) device number that will be used for process "
<< pid << ": " << acc_get_device_num(dev_type)
#ifdef _DEF_PGI_ACCEL
<< " (" << device_name << ")"
#endif
<< "\n";
}else{
EMS_Error("Error: Could not set accelerator device ID");
}
if(MPI::COMM_WORLD.Get_rank()==0){
check_openacc_supported_input();
}
//acc_init(dev_type);
MPI::COMM_WORLD.Barrier();
#endif
This is the only place in the entire code where I use acc_set_device_num().
The output is
<hostname> <#procs per host> (<pid1>, <pid2>, ... ) <#accelerators per host>
tesla-dell-lnx 2 (143028, 143029) 2
Accelerator (GPU) IDs: 0 1
Accelerator (GPU) device number that will be used for process 143028: 0 (Tesla V100-PCIE-32GB)
Accelerator (GPU) device number that will be used for process 143029: 1 (Tesla P100-PCIE-16GB)
When I look at nvidia-smi output I see
Every 10.0s: nvidia-smi Thu Nov 8 16:53:49 2018
Thu Nov 8 16:53:49 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26 Driver Version: 396.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:04:00.0 Off | 0 |
| N/A 30C P0 38W / 250W | 2479MiB / 32510MiB | 12% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... On | 00000000:83:00.0 Off | 0 |
| N/A 29C P0 32W / 250W | 305MiB / 16280MiB | 35% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 143028 C ../../../../bin/s-gpu 424MiB |
| 0 143029 C ../../../../bin/s-gpu 410MiB |
| 1 143029 C ../../../../bin/s-gpu 295MiB |
So, nvidia-smi is indicating that MPI process with rank 1 is actually running/accessing both GPUs!
Present table dumps at the beginning seem to indicate that everything is as it should be, i.e. rank 0 dumps a table for GPU 0 and rank 1 dumps a table for GPU 1:
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 7.0, threadid=1
Present table dump for device[2]: NVIDIA Tesla GPU 1, compute capability 6.0, threadid=1
However, at the end of the program when I call acc update self, both ranks give the same output for acc_present_dump, i.e. they both refer to the same device:
Present table dump for device[0]: , threadid=0
...empty...
I have also output the addresses of the host arrays for both ranks and they differ. I also put the acc update self in an if(rank==0) or if(rank==1) statement to further check that really different ranks execute this statement and give the same output.
I am puzzled how the host-device connection can change during execution if I don’t call acc_set_devIce_num() more than once for each rank.
Any thoughts? Or do you think it is time for you to take a look into the code?
Thanks,
LS