Problem with porting the code to multiple GPUs

Hello. I have a reproducing example code here: .

It is straight and forward. In the beginning N particles are created within the array particles using make_particle() function, then propagator() changes the particle fields x, ir and rs and compressor() sorts the alive particles and kills the dead ones so that they are in the descending order by ir.

The code works properly on 1 GPU (GeForce 650 Ti) using the compile line:

cmake . –DCMAKE_C_COMPILER=pgcc –DCMAKE_CXX_COMPILER=pgc++ -DCMAKE_CXX_FLAGS=”-acc –Minfo=all –mcmodel=medium –ta=tesla:cc30 –Mcuda=cuda10.1”

But there will be installed new nodes on the cluster of my company: 1 Intel CPU + 2 NVIDIA Tesla V100 SXM2 32 GB GPUs. The GPUs are connected via NVLink. And I have to make this reproducing example code effectively work in multi-GPU mode on these nodes (from 1 CPU simultaneously on 2 Tesla V100 GPUs), if it is possible.

I have some questions. Could You be so kind to answer them, please.

  1. The array particles is created on GPU on line 260. If there are 2 GPUs, each of them should have its own copy of this array or NVlink or GPUDirect with peer-to-peer transfers ( or CUDA Unified memory allow to have some shared memory between 2 GPUs?

  2. When I call make_particle() function, I can write

    #pragma acc set_device_num(gpuid,acc_device_nvidia)

    #pragma acc parallel num_gangs(1) vector_length(1) present(particles[0:GL])

where gpuid=0/1, so the OpenAcc compute construct will be executed on the GPU with this id. But if gpuid=0, should I update the data on the GPU with gpuid=1 after this code ran and how to update it?

  1. Finally, the function compressor() on lines 127-252 is the most compute-intensive function of all the program and I don’t understand how to parallelize it between 2 GPUs. I could, probably, divide all the arrays in 2 parts (one for the first GPU, another for the second) and try to sort the 1st half of the array particles on GPU0 and the 2nd half – on GPU1. But if I, for example, sorted the 1st half on GPU0, should I update this part of the array particles on GPU1?

Could You, please, be so kind to answer if You see any ways how to offload the compressor() function to 2 GPUs simultaneously?

I have searched on the Internet for OpenAcc usage for Multi-GPU launch, but all I have found is a very simple Mandelbrot example. My code is more complicated and I don’t know what to do.

I’m assuming at some point you are planning on adding host side parallelization, MPI, OpenMP or in the future OpenACC’s “loop self” clause.

Data regions are per device, so “particles” would be created in the device memory that’s been set by default, via the environment variable ACC_DEVICE_NUM, or via a call to “acc_set_device_num”. If you have multiple devices, each device would need to get it’s own code.

You could add CUDA code to create an shared IPC memory buffer or use NVSHMEM to share memory accross devices, but pure OpenACC does not have this functionality since it would device specific.

.But if gpuid=0, should I update the data on the GPU with gpuid=1 after this code ran and how to update it?

You need to keep in mind which device your doing the compute and make sure that the correct data is on the correct device. So when you put particles on the device, you want to also set the device number, copy the portion of particles that’s to be computed on a device, then set the next device number and now copy the next part of particle to the second device. If you put all of particles on both devices, you will have two complete and separate copies that are not coherent. That’s fine it particles is read-only, but a problem if particles gets updated since you need to exchange data between the two devices in order to keep the memories coherent.

Could You, please, be so kind to answer if You see any ways how to offload the compressor() function to 2 GPUs simultaneously?

My advice would be to parallelize you’re code with MPI. After that, it’s fairly straight forward to have one MPI rank per device. Though the challenge in your code is that is assumes particles is in a shared memory spaces as opposed to being distributed across multiple memories. If you can’t readjust the algorithm so particles is distributed, then you might look at adding NVSHMEM ( or creating an IPC shared memory space ( so particles can be shared across devices.

Thank You very much for the answer!
I chose the simplest way out of all You advised - I tried to independently operate unique copies of the particles array per GPU (tried to readjust the algorithm so that the array particles is distributed the simplest way I could).
The updated code is here: .
I use the compile line:
-DCMAKE_CXX_FLAGS="-acc -Minfo=all -mcmodel=medium -
ta=tesla:cc30 -Mcuda=cuda10.1"
but the code fails with
Failing in Thread:1
call to cuMemFreeHost returned error 700: Illegal address during kernel

I wanted to make it work on 2 GPUs, but it doesn’t work even on a single one.
Could You be so kind to tell what is wrong with it and how to fix it?

The illegal address error is because you’re only copying in a single element of “LIFE” (as well as other “gpu” arrays) but accessing them as an array. Instead, you need to copy in the whole array:

#pragma acc parallel num_gangs(1) vector_length(1) present(particles[0:GL]) copy(LIFE[:NUMBER_OF_GPUS],MAX_ELEMENT[:NUMBER_OF_GPUS])

You’ll want to do this to all the places where you have similar issues.

Though after that when the starts to use multiple GPU, you’ll have an issue with “particles” since you have it copied only to the default device (i.e. device 0). You’ll need to copy it to all the devices in order to access it, decompose the problem, and then merge the results back into the host array.

Again, I don’t advise programming multi-gpus this way. It’s adds a lot of complexity and makes the code far less portable. Granted since you have CUDA API calls, you’re already not portable. Not that you can’t get this to work, it’s just tricky to get right. Using MPI is a much better way to go and will also allow for your code to scale across multiple nodes.


I have a query regarding “NVLink Signaling and Protocol Technology” and specifically Atomic operations via NVLink.

“The protocol uses a variable length packet with packet sizes ranging from 1 (simple read request command for example) to 18 (write request with data for 256B data transfer with address extension) flits” ( - Page 35)

From the above I presume NVLink uses the same protocol to perform an atomic-operation on a peer GPU. I could not find any whitepaper or references that describes the protocol or packets’ . Could you please confirm if my assumption is true? If not then could you please give info on how NVLink performs an atomic-operation?

Thank you in advance.

This is out of my area of expertise, but I asked a colleague who responded “The short answer is we use the same protocol for peer GPUs as we would to a host CPU connected with NVLink”.