best approach for reduction problem

Hi,

I am looking for a solution to a reduction problem that is a little bit more complex than the standard max reduction problem found in various OpenACC examples/docs. I am assuming that reduction variables must be of basic type (int,double etc.) and not struct/class type, otherwise the solution would be trivial.

I need to find the maximum and its location which is an integer tuple:

  double maxfield = 0.;
  int i_max = -1;
  int j_max = -1;
  int k_max = -1;
  for(int i = 0; i < xmax; ++i) {
    for(int j = 0; j < ymax; ++j) {
      for(int k = zmin; k < zmax; ++k) {
        double v = sqrt(F1.Ex[ExID(i, j, k)]*
		        F1.Ex[ExID(i, j, k)] +
		        F2_.Ex[ExID(i, j, k)]*
		        F2_.Ex[ExID(i, j, k)]);
        if(v > maxfield){
	    maxfield = v;
	    i_max = i;
	    j_max = j;
	    k_max = k;
	  }
      }
    }
  }

ExID is just a macro for mapping the tuple (i,j,k) into the one-dimensional array.

If there were no need for the location of the maximum I could simply write the following statement withing a kernels construct:

#pragma acc loop independent collapse(3) reduction(max:maxfield)
for(.....)

I was thinking about thrust::max_element, but then I am not dealing with a single array, but the maximum of the sum of two arrays, which must not be changed. Creating a temporary array for holding the sum will cost me valuable device memory. It also would not directly solve finding the tuple as it returns a linear index, but this wouldn’t be a showstopper as it is a unique mapping and its inverse can be computed.

Do you have any best practices for handling such a reduction problem in an OpenACC context?

Thanks,
LS

Hi LS,

The Max+Index reduction issue has come up before but there’s no easy solution. Given the difficulty, I don’t foresee it becoming part of the OpenACC standard. OpenMP doesn’t have it either. Maybe it could be done if user defined reductions were added to OpenACC? I’ll bring the topic up with our rep on the standard committee.

How I’ve seen this solved is by making two passes. Once to find the max value and a second to find the index. Not ideal, but should work. Something like:

#pragma acc parallel loop collapse(3) reduction(max:maxfield) present(arr)
   for(int i = 0; i < xmax; ++i) {
     for(int j = 0; j < ymax; ++j) {
       for(int k = 0; k < zmax; ++k) {
         double v = arr[i][j][k];
         if(v > maxfield){
            maxfield = v;
         }
    }}}

#pragma acc parallel loop collapse(3) present(arr) copy(i_max,j_max,k_max)
   for(int i = 0; i < xmax; ++i) {
     for(int j = 0; j < ymax; ++j) {
       for(int k = 0; k < zmax; ++k) {
         double v = arr[i][j][k];
         if(v == maxfield){
#pragma acc atomic write
            i_max = i;
#pragma acc atomic write
            j_max = j;
#pragma acc atomic write
            k_max = k;
         }
     }}}

Hi Mat,

Thanks for the info. If there are not too many index tuples with the same maxfield value the performance probably should not suffer too much from the atomic statements, right? Still, two passes are needed, but certainly more efficient that copying the data to the host for the computation.

Thanks,
Lutz

If there are not too many index tuples with the same maxfield value the performance probably should not suffer too much from the atomic statements, right?

The assumption would be that there’s only max so maybe the atomic doesn’t matter and could be gotten rid of?

Though if there are multiple maxfields, I do see a flaw in the solution I posted. While a small risk, there is a possible race condition since, while each individual max index is protected, you really need a critical section around all three index variables. Otherwise you could have the max “i” from one case of “maxfield” and the “j” from another.

Do you have an idea on how many elements of the tuple would match maxfields? If there’s a lot, then I’m thinking that you might need to move to using a spin lock rather than atomics. You’d need to do this in CUDA.

Alternatively since “Ex” is really a single dimension, you could flatten i, j, and k into a single loop and thus only need to search for single index. Of course, “zmin” is the problem with this idea so you would need to add logic to skip these elements.

  • Mat

Hi Mat,

thanks for your help. I managed to get it to work with the two-step approach (the first for finding the max and the second to find the location) by computing the linear index from the (i,j,k) tuple and using atomic write for that linear index. After the second loop nest I simply compute the index tuple (i,j,k) from the linear index and the code proceeds as normal.

I reckoned this would be the best approach given that I might have many locations with the same maxfield.

User-defined reductions would be nice and in OpenMP 4.x this feature is already available and supported by GCC >= 4.9 or so.

Cheers,
LS