Incorrect Results with triply nested loop

I have the following triply nested loop that I am trying to parallelize:

#pragma acc kernels loop independent collapse(3) pcopyout(zv[:szz]) pcopyin(lv[:szl],rv[:szr]) 
for (int i = 0; i < zc; i++) {
  for (int j = 0; j < rc; i++) {
    for (int k = 0; k < lc; i++) {

I am confident that the sizes szz, szr, szl are correct, and I am basically positive that this loop should be independent. The compiler cannot automatically determine the sizes, so the copyin and copyout clauses are required, but I have tried doing this with and without the independent and collapse() clauses to no avail.

I get a compiler warning sometimes that there is a loop carried dependency on zv->. I’m not sure what the problem is here at this point. I am quite certain that zv, lv, and rv are independent, so this loop should be independent as well. For whatever reason, the compiler’s generated code does not preduce the same result as I expect. What am I doing wrong here?

Hi Aaron,

There’s a typo in your code where you have “++i” for each of the the three loops. Once I fix this, I get a the expected feedback messages.

% cat test.c
int foo (float *zv, float *lv, float *rv, int szr, int szz, int szl, int rc, int zc, int lc) {

#pragma acc kernels loop independent collapse(3) pcopyout(zv[:szz]) pcopyin(lv[:szl],rv[:szr])
for (int i = 0; i < zc; i++) {
  for (int j = 0; j < rc; j++) {
    for (int k = 0; k < lc; k++) {
return 1;
% pgcc -acc -c -Minfo=accel test.c -V15.10
      3, Generating copyout(zv[:szz])
         Generating copyin(lv[:szl],rv[:szr])
      4, Loop is parallelizable
      5, Loop is parallelizable
      6, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
          4, #pragma acc loop gang, vector(128) collapse(3) /* blockIdx.x threadIdx.x */
          5,   /* blockIdx.x threadIdx.x collapsed */
          6,   /* blockIdx.x threadIdx.x collapsed */

Hope this helps,

Thanks for the feedback! I’m sorry, that typo was a transcription error. I use a macro for looping and was expanding out the macro for this forum by hand, which led to some problems. The code that is actually running does not have that typo in it.

Without the independent, the dependency message is correct. The issue is that you’re using computed indices so the compiler must assume the indices all compute to the same value. “Independent” asserts to the compiler that the loops are independent.

Though as I show in the example, once you add independent the code parallelizes. Now if you didn’t put independent on every loop (or collapse them), then you would still get warnings for the loops without independent.

Hope this helps clarify what’s going on. If not, please post a reproducing example and the messages you’re seeing and then we can look at the issues you’re seeing.

  • Mat

Ah, thanks for the clarification. I understand why I am getting those messages now. With some more debugging that I’ve been doing, it seems that there’s something weird with the way that I’ve been doing memory management or the like that is leading to the wrong values showing up, so it’s not a problem with this loop directly, but rather with the memory management surrounding the loop. I’ve posted a different thread trying to understand how that works.