Wrong results: 12.5 vs 12.6

Hello,

I am currently working with the pgi compiler 12.6 and I run into some unexpected problems.

While working with 12.5 all the results are correct but if I choose to compile the exact same source code with the pgi 12.6 compiler (no warnings!) the runtime as well as the reults differ (wrong results and totally different runtime).

I realized that the 12.6 compiler chooses to distribute the work in a different way. However, the results should still be correct, right?

Have you encountered similar problems? What could be the reason for this behaviour?

Best,
Paul

Hi PaulPa,

at first you could try to verify GPU code compiler produce

-ta=nvidia,keepgpu

In case compiler really produce incorrect code, fill bug report and send it to PGI team. Hope they suggest you different workarounds for now. Or you can try your own.

Alexey

Hi Paul,

It could be a bug in either the compiler or your code. Can you please post an example?

Thanks,
Mat

Thanks Mat!

Are you going to implement it in the next release?

P.S. what about host_data directive in PGI 12.6 - Legacy PGI Compilers - NVIDIA Developer Forums

Alexey

Are you going to implement it in the next release?

I think they are looking towards the 12.9/12.10 time frame. We’re trying to get back on schedule with a 12.8 release next week and I don’t think they’ll have this done in time.

  • Mat

Hi guys,

sorry for the late response.

Mat, would it be possible to send you the source code via mail?
As I said, it is exactly the same source code and I don’t find any errors.

Best,
Paul

Hi Paul,

Mat, would it be possible to send you the source code via mail?

Yes. Please send it to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me. If it’s a compiler bug, I’ll triage it, submit a report, and hopefully find a work around.

  • Mat

Hi guys,

I just wanted to let you know that the exact same source code is working with
PGI compiler 12.9 again. (12.8 was not working as well).

I realized that 12.9 schedules the work as follows:

121, Loop is parallelizable
         Accelerator kernel generated
        121, #pragma acc loop gang /* blockIdx.x */
             CC 2.0 : 27 registers; 32 shared, 136 constant, 0 local memory bytes
        131, #pragma acc loop vector(256) /* threadIdx.x */

while 12.8 does the following:

121, #pragma acc loop gang /* blockIdx.x threadIdx.x */
             Cached references to size [(x)] block of 'bhat'
             CC 2.0 : 27 registers; 32 shared, 136 constant, 0 local memory bytes
        131, #pragma acc loop vector(32) /* threadIdx.y */

These two look very similar, but 12.8 reports something about threadIdx.x in line 121. This is kind of strange since the feedback doesn’t say anything about vector in this line (different from 12.9).

What does the /* … */ part stand for anyway?

@Mat: This is the same version I filed a bug-report for earlier.

Thank you.

Best,
Paul

Hi Paul,

Yes, TPR#18913 was listed as fixed in 12.9.

What does the /* … */ part stand for anyway?

It’s informational about the correspondence between the OpenACC schedule and the target device schedule. For NVIDIA CUDA, a “gang” corresponds to a “block” and “vector” to “thread”. The “.x”, “.y”, and “.z” are the dimensions.

  • Mat

Hi Mat,

so why does it say:

#pragma acc loop gang /* blockIdx.x threadIdx.x */

I figured that this should be blockIdx only.

Best,
Paul

I figured that this should be blockIdx only.

That was the bug fixed in TPR#18913, it should have be only the blockidx.

  • Mat