For the first post, this is a compiler error so I added TPR#21733. Since a “kernels”, aka “region”, construct is looking for parallelism, it will try to offload the for loop but evaluate the if statement on the host. Since “d” is on the device, I think the compiler is getting confused here. You can work around it by not moving the scalar “d” to the device (i.e. remove it from the data directives).
Note that it’s typically better to not manage scalars. By putting a scalar in a data region you’ve made it global and shared by all threads. Often scalars should be first private, which is the default, and only in rare cases should be shared.
If you want to keep “d” in a data region, this is a case where the “parallel” construct would be better. With “parallel”, you are telling the compiler where to offload to the device, so it will move the if statement to the device. However, by doing this you’ll severely limit the parallelism since you’ll only have one gang and only the inner loop scheduled as a vector. It would be better to move the compute construct (kernels or parallel) around the “for” statement and not around the if statement.
For the second issue, you’re not actually creating valid device code. Here’s the compiler feedback messages:
% pgcc 06_16_15c.c -acc -Minfo=accel
5, Generating enter data create(d)
8, Generating update device(d)
13, Accelerator restriction: scalar variable live-out from loop: d
Accelerator scalar kernel generated
14, Accelerator restriction: scalar variable live-out from loop: d
15, Accelerator restriction: induction variable live-out from loop: d
19, Generating update host(d)
20, Generating exit data delete(d)
The problem is that since “d” gets updated by all threads, it can lead to a race condition where some threads may get an older copy of “d” before other threads have updated it. This is a case where you need to put “d” in a reduction clause so the correct device code is created.
Here’s an example. I also took the liberty of removing the data management of “d”.
% cat 06_16_15d.c
#pragma acc kernels loop reduction(+:d)
% pgcc 06_16_15d.c -acc -Minfo=accel
8, Loop is parallelizable
9, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
8, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
9, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
Sum reduction generated for d
Hope this helps,