Error with #define statement for accessing array

I seem to have run into a possible bug using a #define statement to map a 2D array call to a 1D array with OpenACC, e.g.:

#define pres_red(I, J) pres_red[((I) * ((NUM_2) + 2)) + (J)]

Using this in a number of places in my code, I found that certain locations (but not all, strangely) in device memory were not being accessed properly. Using this without OpenACC (just the CPU) works fine, and replacing these statements with the explicit form appears to work.

Is this a known issue?

Hi Kyle,

What’s the error message from “-Minfo=accel”? My guess the problem is that by using a computed index the compiler can tell that all iterations of the loop don’t update the same element of the array. In these cases, you need to add the “independent” clause to the “loop” directive to tell the compiler the iterations are independent and thus can be parallelized.

  • Mat

Hi Mat,

Unfortunately, I’m not getting an error message from the compiler — it compiles and runs without any apparent issue. The error is in the results.

I actually already have the “loop independent” clause; the function where I’m noticing the error looks like this:

#pragma acc kernels present(F[0:SIZE], G[0:SIZE], pres_red[0:SIZEP], pres_black[0:SIZEP])
  #pragma acc loop independent
    for (col = 1; col < NUM + 1; ++col) {
      #pragma acc loop independent
      for (row = 1; row < (NUM / 2) + 1; ++row) {

	int NUM_2 = NUM >> 1;
	
	Real p_ij = pres_black(col, row);

	Real p_im1j = pres_red(col - 1, row);
	Real p_ip1j = pres_red(col + 1, row);
	Real p_ijm1 = pres_red(col, row - ((col + 1) & 1));
	Real p_ijp1 = pres_red(col, row + (col & 1));
			
	// right-hand side
	Real rhs = (((F(col, (2 * row) - ((col + 1) & 1))
			- F(col - 1, (2 * row) - ((col + 1) & 1))) / dx)
			+ ((G(col, (2 * row) - ((col + 1) & 1))
			- G(col, (2 * row) - ((col + 1) & 1) - 1)) / dy)) / dt;
	
	pres_black(col, row) = p_ij * (ONE - omega) + omega * 
					(((p_ip1j + p_im1j) / (dx * dx))
					+ ((p_ijp1 + p_ijm1) / (dy * dy)) - rhs)
					/ ((TWO / (dx * dx)) + (TWO / (dy * dy)));
			
      }
    }

With my testing, I noticed that the “p_ip1j” variable wasn’t accessing the correct location in the “pres_red” array (which was calculated in a previous function and looks similar to this one).

When I changed the pres_red accesses to the explicit 1D form (as in the #define statement), it seemed to correct the issue. Strangely, the calls to the “F” and “G” arrays, which use a similar #define, are fine.

My only guess is that the #define is acting weird because it relies on a local variable (NUM_2), whereas the #define statements for F and G do not.

It turns out I spoke too soon… After replacing all my pres_red and pres_black references to the explicit array access, the code isn’t working again (CPU version is still fine, however).

If it tells you anything, this is what -Minfo=accel tells me:
450, Generating present(pres_black[0:24])
Generating present(pres_red[0:24])
Generating present(G[0:36])
Generating present(F[0:36])
Generating compute capability 2.0 binary
452, Loop is parallelizable
454, Loop is parallelizable
Accelerator kernel generated
452, #pragma acc loop gang, vector(8) /* blockIdx.y threadIdx.y /
454, #pragma acc loop gang, vector(32) /
blockIdx.x threadIdx.x */
Cached references to size [(x)] block of ‘pres_red’
CC 2.0 : 40 registers; 0 shared, 88 constant, 0 local memory bytes

Hi kyle,

Can you please send a reproducing example to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me? While I highly doubt the problem is with the macro, I’m not sure what the issue is and it will take some digging to figure it out.

Thanks,
Mat

I sent an email with code—thanks for your help! I’m going to keep working on it as well.

I removed all uses of the macro, but the problem is persisting—so it isn’t that. Through a little testing, it looks like the “pres_red” array calls in the “black_kernel” function and “pres_black” calls in the “red_kernel” function somehow aren’t returning the correct values.

Hi Kyle,

Thanks for the example code. I was able to track down two issues.

First is a problem with the auto-caching in three of the routines. If I add the flag “-ta=nvidia,nocache” or hoist the declaration of NUM_2 out of the loops then you will get better answers. I have filed this problem as TPR#19040 and sent it to our compiler engineers for further investigation.

The second issue is the if statement “if ((col & 1) == 1) {” in calculate_v. While I’m not sure why, this seems to generate incorrect code. The work around is to change this to the equivalent statement “if ((col%2) == 1) {”. I have filed this one as TPR#19041.

Note that there is a small difference where the some values set in “set_BCs” get returned as -0.0 on the GPU versus +0.0 on the CPU. However, these are the same values. Though, in a strict “diff” of the results they are flagged as different.

I have sent you an updated file with my modifications.

Best Regards,
Mat

TPR 19041 - OpenACC “if ((col & 1) == 1) {” in a compute region give wrong answers

has been fixed in the current 14.9 release.

Thanks again for your report.

dave