Reduction loop generates CUDA memcpy

Hi,

I have a simple reduction loop:

      #pragma acc parallel loop present (partial_results) reduction(+:scat_int_calc)
        for (int iter = 0; iter < nadditions; ++iter)  {
          scat_int_calc += partial_results[iter];
        }

When I compile the code, nvprof shows that this loop generated a CUDA memcpy D to H and H to D.

Here is part of the nvprof log. When I comment the reduction loop, the memcpy goes away.

25.63% 3.02789s 494640 6.1210us 5.8870us 8.6720us _Z19calculate_intensityiiPA5_dPiPd_230_gpu_red
22.60% 2.67036s 494640 5.3980us 5.1510us 8.1920us _Z19calculate_intensityiiPA5_dPiPd_230_gpu
9.11% 1.07616s 494640 2.1750us 2.0470us 16.608us [CUDA memcpy DtoH]

I wonder why this is needed?The problem is the API overhead of the memcpy is 18us which is very expensive.

Thanks,
George.

Hi George,

The reduction variable needs to be copied back from the device and why you’re seeing the memcpy.

If you don’t need the reduction variable on the host, manage it via a data region.

#pragma acc data create(scat_int_calc)
{
 ...
      #pragma acc parallel loop present (partial_results) reduction(+:scat_int_calc) 
        for (int iter = 0; iter < nadditions; ++iter)  { 
          scat_int_calc += partial_results[iter]; 
        }
 ...
      #pragma acc parallel ....
    {
           ... use scat_int_calc on the device ...
    }
  ...
}

Hope this helps,
Mat

Thanks Mat.

I see you point. As the overhead of the cuda memcpy API is high, I can group multiple reductions in one kernel call. So would something like the following code snippet possible in openacc?

double *restrict reduced_results = new double[60];

#pragma acc loop copyout(reduced_results) present(partial_results)
for (int i = 0; i < 60; i ++) {
    double scat_int_calc = 0;
    
    #pragma acc loop reduction (+:scat_int_calc)
    for (int j = 0; j < nadditions; j++) {
        scat_int_calc += partial_results[i * nadditions + j];
    }
    reduced_results[i] = scat_int_calc;
}

Best regards,
George.

Hi George,

So would something like the following code snippet possible in openacc?

Yes, this is a typical case where you have an inner “vector” loop reduction inside a “gang” loop.

Note that there is overhead in order to perform a reduction. So if “nadditions” is relatively small, it may be better to run the inner loop sequentially.

  • Mat

Thanks Mat. That worked. nAdditions is actually large ~ 5000.

I observed that the reduction kernel is implemented within on gang which I guess is a thread block. So there is no need for kernel synchronization as in the case of a 2 steps reduction (within thread blocks and between blocks). However, my code now can only scale with 61 iteration in the outer loop. This is fine for a K20 GPU with 14 multiprocessor, but when the number of multiprocessors increases, I won’t be able to take advantage of that.

So is there a way in OpenACC to combine the two loops into a single loop then divide them between gangs?

So is there a way in OpenACC to combine the two loops into a single loop then divide them between gangs?

I can’t think of a way to do this since you can’t do partial reductions across gangs and then do the final reduction back in the original gang.

One experiment you might try is to only parallelize the inner loop:

for (int i = 0; i < 60; i ++) { 
    double scat_int_calc = 0; 
    
    #pragma acc parallel loop reduction (+:scat_int_calc) present(partial_results)
    for (int j = 0; j < nadditions; j++) { 
        scat_int_calc += partial_results[i * nadditions + j]; 
    } 
    reduced_results[i] = scat_int_calc; 
}

You’ll add extra overhead due to the increased number of kernel launches but save the copyout of reduced_results and help increase the amount of parallelism. Granted 5000 really isn’t that large so it still may be better to keep the original schedule, but it’s an easy experiment to try.

  • Mat

Actually I started with your proposed method and I found the Kernel launch and the Memcpy overheads are very large to diminish any improvement in performance.

I agree that 5000 is not that large, but given the number of threads per block is limited by 1024, I would think it might be better to create 5 thread blocks each 1024 other than using an inner loop within one thread block.

Thanks again for your support.

Best regards,
George.