Possible bug in triangular loop

I have extracted the following example from a much bigger application illustrating some slightly non-trivial doubly nested loops. Essentially the code transposes a matrix in place. The following code works fine on the host with OMP but fails on the device. Clearly triangular loops are less desirable than rectangular ones, but changing all the loops in the application is not an option.

Could you explain either (1) why the example code is buggy, or (2) if it is a compiler problem, could you open a ticket?

Compiler: 12.9

Compile line: pgf95 -c -O3 -fastsse -I. -acc -ta=nvidia -Minfo=accel -Mpreprocess -mp=nonuma transpose_bug.f90

PROGRAM transpose_bug

IMPLICIT NONE

INTEGER, PARAMETER :: n = 3
INTEGER :: i, j
DOUBLE PRECISION, POINTER :: array_orig_h(:,:), array1_h(:,:)
DOUBLE PRECISION :: tmp, tmp2(3)

ALLOCATE( array_orig_h( n, n ) )
ALLOCATE( array1_h( n, n ) )

DO i=1, n
DO j=1, n
array_orig_h(i,j) = DBLE(i) + DBLE(j)/1000.0
array1_h(i,j) = array_orig_h(i,j)
ENDDO
ENDDO

print *, “Original matrix”, array_orig_h

! Transpose array1_h on host (OMP version)

!$OMP PARALLEL DO PRIVATE(i,j,tmp)
DO i=1, n
DO j=1, i-1
! Exchange element at array1_h(i,j)
tmp = array1_h(i,j); array1_h(i,j) = array1_h(j,i); array1_h(j,i) = tmp
ENDDO
ENDDO
!$OMP END PARALLEL DO

print *, “after first host transpose”, array1_h

! Transpose array1_h on host (OMP version)

!$OMP PARALLEL DO PRIVATE(i,j,tmp)
DO i=1, n
DO j=1, i-1
! Exchange element at array1_h(i,j)
tmp = array1_h(i,j); array1_h(i,j) = array1_h(j,i); array1_h(j,i) = tmp
ENDDO
ENDDO
!$OMP END PARALLEL DO

print *, “after second host transpose”, array1_h

print *, "HOST/HOST transpose: Difference to original should be zero ", SUM(array1_h-array_orig_h)


!$acc data copy(array1_h)

!$ACC PARALLEL LOOP PRIVATE(i,j,tmp)
DO i=1, n
DO j=1, i-1
! Exchange element at array1_h(i,j)
tmp = array1_h(i,j); array1_h(i,j) = array1_h(j,i); array1_h(j,i) = tmp
ENDDO
ENDDO
!$ACC END PARALLEL LOOP

!$acc wait

print *, “after first device transpose”, array1_h

!$ACC PARALLEL LOOP PRIVATE(i,j,tmp)
DO i=1, n
DO j=1, i-1
! Exchange element at array1_h(i,j)
tmp = array1_h(i,j); array1_h(i,j) = array1_h(j,i); array1_h(j,i) = tmp
ENDDO
ENDDO
!$ACC END PARALLEL LOOP

!$acc wait

!$acc end data

print *, “result”, array1_h

print *, "DEVICE/DEVICE transpose: Difference to original should be zero but is not ", SUM(array1_h-array_orig_h)

DEALLOCATE( array_orig_h, array1_h )

END PROGRAM transpose_bug

Comments: the host output gives the expected result, namely

HOST/HOST transpose: Difference to original should be zero
0.000000000000000

The second device transpose yields:

result 1.001000000000000 2.001000000000000
3.002000000000000 1.002000000000000 2.002000000000000
3.002000000000000 2.003000000000000 2.003000000000000
3.003000000000000
DEVICE/DEVICE transpose: Difference to original should be zero but is not
1.001000000000000


One interesting result of looking at the dump with NVDEBUG=1 is that tmp seems to be allocated with cu_alloc, and is not in a register (as it should be):

__pgi_cu_module_function( name=0x6906b0=transpose_bug_66_gpu, lineno=66, argname=(nil)=, argsize=40, varname=(nil)=, varsize=0, SWcachesize=0 )
Function handle is 0x2015770
__pgi_cu_alloc(size=24,lineno=66,name=tmp)
__pgi_cu_alloc(24) returns 0x200200200 (address=0x7fff4cc18158)

The fact that tmp is allocated with 24 bytes (3x8) is suspicious: if a GPU private array were used instead of a register, it would be safer to have an n x n array.

I’ve discussed this at length with Jeff Poznanovic and have played around with the directives, e.g., adding an !$acc loop directive around the inner loop. The problem persists.

Hope you can give us some input.

Thanks!

–Will

Hi Will,

The problem here is that you privatized TMP in the outer loop. Since the outer loop is the “gang” (or block), this means TMP variable is private to the gang, but shared amongst the vectors (threads) in the gang.

To fix, remove the private clauses. Scalars are private by default so it’s unnecessary. Also, by privatizing a scalar, you’ve created an array of these variables in global memory which needs to fetched. Without the private clause, “tmp” will be made a local variable and most likely be put in a register.

Here’s the modified code:

% cat transpose1.f90
PROGRAM transpose_bug

IMPLICIT NONE

INTEGER, PARAMETER :: n = 3
INTEGER :: i, j
DOUBLE PRECISION, POINTER :: array_orig_h(:,:), array1_h(:,:)
DOUBLE PRECISION :: tmp, tmp2(3)

ALLOCATE( array_orig_h( n, n ) )
ALLOCATE( array1_h( n, n ) )

DO i=1, n
DO j=1, n
array_orig_h(i,j) = DBLE(i) + DBLE(j)/1000.0
array1_h(i,j) = array_orig_h(i,j)
ENDDO
ENDDO

print *, "Original matrix", array_orig_h

! Transpose array1_h on host (OMP version)

!$OMP PARALLEL DO PRIVATE(i,j,tmp)
DO i=1, n
DO j=1, i-1
! Exchange element at array1_h(i,j)
tmp = array1_h(i,j); array1_h(i,j) = array1_h(j,i); array1_h(j,i) = tmp
ENDDO
ENDDO
!$OMP END PARALLEL DO

print *, "after first host transpose", array1_h

! Transpose array1_h on host (OMP version)

!$OMP PARALLEL DO PRIVATE(i,j,tmp)
DO i=1, n
DO j=1, i-1
! Exchange element at array1_h(i,j)
tmp = array1_h(i,j); array1_h(i,j) = array1_h(j,i); array1_h(j,i) = tmp
ENDDO
ENDDO
!$OMP END PARALLEL DO

print *, "after second host transpose", array1_h

print *, "HOST/HOST transpose: Difference to original should be zero ", SUM(array1_h-array_orig_h)


!$acc data copy(array1_h)

!$ACC PARALLEL LOOP 
DO i=1, n
DO j=1, i-1
! Exchange element at array1_h(i,j)
tmp = array1_h(i,j); 
array1_h(i,j) = array1_h(j,i); 
array1_h(j,i) = tmp
ENDDO
ENDDO
!$ACC END PARALLEL LOOP

!$acc wait

print *, "after first device transpose", array1_h

!$ACC PARALLEL LOOP  
DO i=1, n
DO j=1, i-1
! Exchange element at array1_h(i,j)
tmp = array1_h(i,j); 
array1_h(i,j) = array1_h(j,i); 
array1_h(j,i) = tmp
ENDDO
ENDDO
!$ACC END PARALLEL LOOP

!$acc wait

!$acc end data

print *, "result", array1_h

print *, "DEVICE/DEVICE transpose: Difference to original should be zero but is not ", SUM(array1_h-array_orig_h)

DEALLOCATE( array_orig_h, array1_h )

END PROGRAM transpose_bug 
% pgf95 -O3 -fastsse -I. -acc -ta=nvidia -Minfo=accel -Mpreprocess -mp=nonuma transpose1.f90 -V12.9; a.out

transpose_bug:
     51, Generating copy(array1_h(:,:))
     53, Accelerator kernel generated
         53, CC 1.3 : 16 registers; 52 shared, 8 constant, 0 local memory bytes
             CC 2.0 : 17 registers; 4 shared, 64 constant, 0 local memory bytes
         54, !$acc loop gang ! blockidx%x
         55, !$acc loop vector(256) ! threadidx%x
     53, Generating present_or_copy(array1_h(:,:))
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     55, Loop is parallelizable
     68, Accelerator kernel generated
         68, CC 1.3 : 16 registers; 52 shared, 8 constant, 0 local memory bytes
             CC 2.0 : 17 registers; 4 shared, 64 constant, 0 local memory bytes
         69, !$acc loop gang ! blockidx%x
         70, !$acc loop vector(256) ! threadidx%x
     68, Generating present_or_copy(array1_h(:,:))
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     70, Loop is parallelizable
 Original matrix    1.001000000000000         2.001000000000000      
    3.001000000000000         1.002000000000000         2.002000000000000      
    3.002000000000000         1.003000000000000         2.003000000000000      
    3.003000000000000     
 after first host transpose    1.001000000000000         1.002000000000000     
     1.003000000000000         2.001000000000000         2.002000000000000     
     2.003000000000000         3.001000000000000         3.002000000000000     
     3.003000000000000     
 after second host transpose    1.001000000000000      
    2.001000000000000         3.001000000000000         1.002000000000000      
    2.002000000000000         3.002000000000000         1.003000000000000      
    2.003000000000000         3.003000000000000     
 HOST/HOST transpose: Difference to original should be zero  
    0.000000000000000     
 after first device transpose    1.001000000000000      
    2.001000000000000         3.001000000000000         1.002000000000000      
    2.002000000000000         3.002000000000000         1.003000000000000      
    2.003000000000000         3.003000000000000     
 result    1.001000000000000         2.001000000000000      
    3.001000000000000         1.002000000000000         2.002000000000000      
    3.002000000000000         1.003000000000000         2.003000000000000      
    3.003000000000000     
 DEVICE/DEVICE transpose: Difference to original should be zero but is not  
    0.000000000000000

Hi Mat,

Thanks very much for the quick reply. I can confirm your results.

Related questions: why should the introduction of PRIVATE(tmp) yield incorrect results? When are there guarantees that the GPU will give the same results? Is the correctness of results implementation dependent?

We are now testing the application with Cray CCE, but running into problems with their compiler.

Thanks again,

–Will

Hi Will,

why should the introduction of PRIVATE(tmp) yield incorrect results

It’s because you privatized it on the loop used to generate the “gang” schedule, hence a single “tmp” is being used for all threads within a gang.

When are there guarantees that the GPU will give the same results?

Assuming a correct parallel program, the CPU and GPU results should be in agreement. There may be some differences in floating point value, but typically these are within a expected tolerance.

Is the correctness of results implementation dependent?

Do you mean a vendor’s implementation of OpenACC or user’s specific code?

  • Mat

Is the correctness of results implementation dependent?

Do you mean a vendor’s implementation of OpenACC or user’s specific code?

OpenACC vendor implementation. The original code did not yield incorrect results with CCE 8.1.1.

Thanks again, --Will

Hi Will,

OpenACC vendor implementation. The original code did not yield incorrect results with CCE 8.1.1.

Most likely the difference is how they scheduled the loop where “vector” was used on the outer loop.

In general, it is not recommended to privatize scalars. Besides the issue you encountered where the change in schedule changed the behavior, privatized scalars need to be put in global memory instead of a local register (except when spilled when not enough registers are available).

  • Mat