Problem with porting the code to multiple GPUs

Hello. I have a reproducing example code here: https://github.com/AndStorm/QUESTION.git .

It is straight and forward. In the beginning N particles are created within the array particles using make_particle() function, then propagator() changes the particle fields x, ir and rs and compressor() sorts the alive particles and kills the dead ones so that they are in the descending order by ir.

The code works properly on 1 GPU (GeForce 650 Ti) using the compile line:

cmake . –DCMAKE_C_COMPILER=pgcc –DCMAKE_CXX_COMPILER=pgc++ -DCMAKE_CXX_FLAGS=”-acc –Minfo=all –mcmodel=medium –ta=tesla:cc30 –Mcuda=cuda10.1”

But there will be installed new nodes on the cluster of my company: 1 Intel CPU + 2 NVIDIA Tesla V100 SXM2 32 GB GPUs. The GPUs are connected via NVLink. And I have to make this reproducing example code effectively work in multi-GPU mode on these nodes (from 1 CPU simultaneously on 2 Tesla V100 GPUs), if it is possible.

I have some questions. Could You be so kind to answer them, please.

  1. The array particles is created on GPU on line 260. If there are 2 GPUs, each of them should have its own copy of this array or NVlink or GPUDirect with peer-to-peer transfers (https://www.olcf.ornl.gov/wp-content/uploads/2018/11/multi-gpu-workshop.pdf) or CUDA Unified memory allow to have some shared memory between 2 GPUs?

  2. When I call make_particle() function, I can write

    #pragma acc set_device_num(gpuid,acc_device_nvidia)

    #pragma acc parallel num_gangs(1) vector_length(1) present(particles[0:GL])

where gpuid=0/1, so the OpenAcc compute construct will be executed on the GPU with this id. But if gpuid=0, should I update the data on the GPU with gpuid=1 after this code ran and how to update it?

  1. Finally, the function compressor() on lines 127-252 is the most compute-intensive function of all the program and I don’t understand how to parallelize it between 2 GPUs. I could, probably, divide all the arrays in 2 parts (one for the first GPU, another for the second) and try to sort the 1st half of the array particles on GPU0 and the 2nd half – on GPU1. But if I, for example, sorted the 1st half on GPU0, should I update this part of the array particles on GPU1?

Could You, please, be so kind to answer if You see any ways how to offload the compressor() function to 2 GPUs simultaneously?

I have searched on the Internet for OpenAcc usage for Multi-GPU launch, but all I have found is a very simple Mandelbrot example. My code is more complicated and I don’t know what to do.

I’m assuming at some point you are planning on adding host side parallelization, MPI, OpenMP or in the future OpenACC’s “loop self” clause.

Data regions are per device, so “particles” would be created in the device memory that’s been set by default, via the environment variable ACC_DEVICE_NUM, or via a call to “acc_set_device_num”. If you have multiple devices, each device would need to get it’s own code.

You could add CUDA code to create an shared IPC memory buffer or use NVSHMEM to share memory accross devices, but pure OpenACC does not have this functionality since it would device specific.

.But if gpuid=0, should I update the data on the GPU with gpuid=1 after this code ran and how to update it?

You need to keep in mind which device your doing the compute and make sure that the correct data is on the correct device. So when you put particles on the device, you want to also set the device number, copy the portion of particles that’s to be computed on a device, then set the next device number and now copy the next part of particle to the second device. If you put all of particles on both devices, you will have two complete and separate copies that are not coherent. That’s fine it particles is read-only, but a problem if particles gets updated since you need to exchange data between the two devices in order to keep the memories coherent.

Could You, please, be so kind to answer if You see any ways how to offload the compressor() function to 2 GPUs simultaneously?

My advice would be to parallelize you’re code with MPI. After that, it’s fairly straight forward to have one MPI rank per device. Though the challenge in your code is that is assumes particles is in a shared memory spaces as opposed to being distributed across multiple memories. If you can’t readjust the algorithm so particles is distributed, then you might look at adding NVSHMEM (https://developer.nvidia.com/nvshmem) or creating an IPC shared memory space (https://github.com/rmfarber/ParallelProgrammingWithOpenACC/blob/master/Chapter05/mpi_ipc_acc_map_data.c) so particles can be shared across devices.