Parallel construct reductions

Here’s an interesting one.
First, the code. It’s from a bigger code hence the line numbers not making sense. 460 corresponds to the data copyin line.
I’m using PGCC v12.8.

#pragma acc data copyin(a[0:n])
{

#pragma acc parallel loop reduction(+:z)
    for (i=0;i<n;i++){
        z += a[i];
    }
}

And the compilation output:

    460, Generating copyin(a[0:n])
    464, Accelerator kernel generated
        464, CC 1.0 : 7 registers; 48 shared, 32 constant, 0 local memory bytes
             CC 2.0 : 12 registers; 0 shared, 72 constant, 0 local memory bytes
        465, #pragma acc loop gang, vector(256) /* blockIdx.x threadIdx.x */
    464, Generating copyin(a[0:n])
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary

So, some questions:

  1. Could the compiler output be changed so that it tells me it’s really done the reduction? If I omit the reduction clause, the compiler realises I wanted a reduction and inserts one for me AND tells me it’s happened. I presume it’s working as I get the following from the runtime diagnostic
    464: region entered 10 times
        time(us): total=4,798 init=3 region=4,795
                  kernels=1,525
        w/o init: total=4,795 max=1,253 min=391 avg=479
        464: kernel launched 10 times
            grid: [4096]  block: [256]
            time(us): total=1,395 max=143 min=138 avg=139
        465: kernel launched 10 times
            grid: [1]  block: [256]
            time(us): total=130 max=13 min=13 avg=13

I presume this means that the kernel launched at 465 is really a reduction function/kernel inserted by the compiler.

  1. Can I disable the automatic reduction detection? Sometimes, when I want to show someone the reduction clause working, it’d be nice to be able to show it not working and how this might cause the wrong answer due to overwrites of z etc (or z being automatically privatized).

  2. Looking at the runtime output, I can see only one copyin is done:

    460: region entered 10 times
        time(us): total=13,067 init=3 region=13,064
                  data=6,335
        w/o init: total=13,064 max=3,933 min=1,009 avg=1,306

But this is confusing as the compiler says it does a copyin at 464 as well as at 460. Is this just a compiler output bug?

Cheers,
-Nick.

Hi Nick,

I presume this means that the kernel launched at 465 is really a reduction function/kernel inserted by the compiler.

That seems correct. Though, I think you’re right in that an informational message should be emitted when a reduction is generated from a reduction clause, not just when the compiler auto-detects one. I added TPR#18894 to track this.

Can I disable the automatic reduction detection?

No because without the reduction code, the loop is not parallel and either no or sequential kernel would be generated.

But this is confusing as the compiler says it does a copyin at 464 as well as at 460. Is this just a compiler output bug?

The second one is actually a “present”. The compiler will do a runtime check to make sure the “a” in the kernel is the same as the “a” in the data copy. By doing this, we can support pointer swapping. Though, yes, the output is confusing and I have an open issue (TPR#18858) requesting that our engineers make this more clear.

Best Regards,
Mat

Thanks Mat.

-Nick.

Nick,

We have added more information to -Minfo, including when reductions
are performed in ACC codes. See our 14.1 release.

thanks,
dave