Hello. I have a reproducing example code here: https://github.com/AndStorm/QUESTION.git .
It is straight and forward. In the beginning N particles are created within the array particles using make_particle() function, then propagator() changes the particle fields x, ir and rs and compressor() sorts the alive particles and kills the dead ones so that they are in the descending order by ir.
The code works properly on 1 GPU (GeForce 650 Ti) using the compile line:
cmake . –DCMAKE_C_COMPILER=pgcc –DCMAKE_CXX_COMPILER=pgc++ -DCMAKE_CXX_FLAGS=”-acc –Minfo=all –mcmodel=medium –ta=tesla:cc30 –Mcuda=cuda10.1”
But there will be installed new nodes on the cluster of my company: 1 Intel CPU + 2 NVIDIA Tesla V100 SXM2 32 GB GPUs. The GPUs are connected via NVLink. And I have to make this reproducing example code effectively work in multi-GPU mode on these nodes (from 1 CPU simultaneously on 2 Tesla V100 GPUs), if it is possible.
I have some questions. Could You be so kind to answer them, please.
The array particles is created on GPU on line 260. If there are 2 GPUs, each of them should have its own copy of this array or NVlink or GPUDirect with peer-to-peer transfers (https://www.olcf.ornl.gov/wp-content/uploads/2018/11/multi-gpu-workshop.pdf) or CUDA Unified memory allow to have some shared memory between 2 GPUs?
When I call make_particle() function, I can write
#pragma acc set_device_num(gpuid,acc_device_nvidia)
#pragma acc parallel num_gangs(1) vector_length(1) present(particles[0:GL])
where gpuid=0/1, so the OpenAcc compute construct will be executed on the GPU with this id. But if gpuid=0, should I update the data on the GPU with gpuid=1 after this code ran and how to update it?
- Finally, the function compressor() on lines 127-252 is the most compute-intensive function of all the program and I don’t understand how to parallelize it between 2 GPUs. I could, probably, divide all the arrays in 2 parts (one for the first GPU, another for the second) and try to sort the 1st half of the array particles on GPU0 and the 2nd half – on GPU1. But if I, for example, sorted the 1st half on GPU0, should I update this part of the array particles on GPU1?
Could You, please, be so kind to answer if You see any ways how to offload the compressor() function to 2 GPUs simultaneously?
I have searched on the Internet for OpenAcc usage for Multi-GPU launch, but all I have found is a very simple Mandelbrot example. My code is more complicated and I don’t know what to do.