Single CPU thread multi-GPU array reduction/sharing

I have written an application that extracts sightlines through hydrodynamics simulations, which is successfully accelerated on a single GPU using OpenACC. Here is some (pseudo)code showing the style of parallelism:

int N = 30000000;     /* size of simulation, e.g. SPH particles or grid cells */
double *simulation;   /* some hydro quantity like fluid density */
double *x_sim;        /* coordinates of simulation quantity */

... /* some further initialization here */

/* Transfer data onto GPU */
#pragma enter data copyin(simulation[0:N], x_sim[0:N])

int L = 10000;       /* number of sightlines to extract */
int G = 2048;        /* number of grid cells along sightline */ 
double *sightlines;  /* the quantity extracted along sightline */
double *x_los;       /* coordinates of sightline grid cells */

... /* some further initialization here */

/* Extract sightlines */
#pragma acc kernels present(simulation[0:N], x_sim[0:N]) copyin(x_los[0:L*G]) copy(sightlines[0:L*G])
#pragma acc loop independent
for (int i = 0; i < L; i++) {
  for (int j = 0; j < N; j++) {
    for (int k = 0; k < G; k++) {
     /* distance from sightline grid cell to simulation cell */
      double factor = abs(x_los[i*G + k] - x_sim[j]) ;
    
      /* extracted quantity weighted by distance */
      sightlines[i*G + k] += simulation[j]/factor;

    } /* loop over a given sightline's grid cells */
  } /* loop over simulation */
} /* loop over sightlines */

#pragma exit data delete(P[0:N], x_sim[0:N])

Note the GPU has a copy of the simulation, and I parallelize the calculation across the sightlines. The sightlines are independent, so each gang (thread?) of the GPU can handle a sightline. The sightlines array is shared across the threads in the global GPU memory, but this is memory safe because they will only write to their relevant (non-overlapping) parts.

However I now want to run this application on really big simulations, and for many many sightlines. This means the time to compute one sightline goes up, and also that we start having many more sightlines than GPU threads; both of these factors make the application slow (although of course still faster than running in serial on one CPU!) An obvious next step might be to increase the number of GPUs, taking advantage of the independence of each sightline again. I can give each GPU a copy of the simulation array, and then they can each handle a subset of the sightlines.

Here’s where I am stuck on the best approach to this. Is there an OpenACC way to give the whole sightlines array to each GPU, let them each calculate their own parts (which are independent), and then effectively reduce them back onto the host’s copy of the array? If doing this naively I would expect that each time the GPU transferred back its copy to the host, it would overwrite whatever had been previously written by other GPUs.

If there isn’t a straightforward way to do this in OpenACC then I guess I can manually create sub-arrays for each GPU, and do the reduction on the host afterwards.

Thanks in advance for any help!

What is the output when you compile with -Minfo=accel? I believe there may be some sections of your code that could be further parallelized. Only the outer i loop is currently parallelized; the k-loop can be accelerated and you can hoist simulation[j]/factor up out of the loop, though the compiler may be doing that automatically for you.

Regarding multi-GPUs, using MPI will probably be your best bet. There is currently no way to copy your entire array to each GPU; you will need to decompose the domain and transfer each portion to each individual GPU. However, on transfer back to the host, you could have MPI do a reduction across the ranks.

Thanks for your response!

What is the output when you compile with -Minfo=accel? I believe there may be some sections of your code that could be further parallelized.

So you’re probably right that there could be some extra parallelization possible here. The compiler doesn’t automatically do it, I think because of various (non-restricted) pointers involved in the real code, so the -Minfo=accel output looks like:

        168, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
        175, #pragma acc loop seq
        189, #pragma acc loop seq

The middle loop is not independent, but the final loop over sightline grid cells is independent and could be parallelized. I will try adding a

#pragma acc loop independent

to see if that can be parallelized more efficiently.

Regarding multi-GPUs, using MPI will probably be your best bet.

Fair enough, if there’s currently no way to do it with OpenACC alone then MPI or perhaps OpenMP is the easiest way to go. I’ll give that a try. Thanks!