I have extracted the following example from a much bigger application illustrating some slightly non-trivial doubly nested loops. Essentially the code transposes a matrix in place. The following code works fine on the host with OMP but fails on the device. Clearly triangular loops are less desirable than rectangular ones, but changing all the loops in the application is not an option.
Could you explain either (1) why the example code is buggy, or (2) if it is a compiler problem, could you open a ticket?
Compiler: 12.9
Compile line: pgf95 -c -O3 -fastsse -I. -acc -ta=nvidia -Minfo=accel -Mpreprocess -mp=nonuma transpose_bug.f90
PROGRAM transpose_bug
IMPLICIT NONE
INTEGER, PARAMETER :: n = 3
INTEGER :: i, j
DOUBLE PRECISION, POINTER :: array_orig_h(:,:), array1_h(:,:)
DOUBLE PRECISION :: tmp, tmp2(3)
ALLOCATE( array_orig_h( n, n ) )
ALLOCATE( array1_h( n, n ) )
DO i=1, n
DO j=1, n
array_orig_h(i,j) = DBLE(i) + DBLE(j)/1000.0
array1_h(i,j) = array_orig_h(i,j)
ENDDO
ENDDO
print *, “Original matrix”, array_orig_h
! Transpose array1_h on host (OMP version)
!$OMP PARALLEL DO PRIVATE(i,j,tmp)
DO i=1, n
DO j=1, i-1
! Exchange element at array1_h(i,j)
tmp = array1_h(i,j); array1_h(i,j) = array1_h(j,i); array1_h(j,i) = tmp
ENDDO
ENDDO
!$OMP END PARALLEL DO
print *, “after first host transpose”, array1_h
! Transpose array1_h on host (OMP version)
!$OMP PARALLEL DO PRIVATE(i,j,tmp)
DO i=1, n
DO j=1, i-1
! Exchange element at array1_h(i,j)
tmp = array1_h(i,j); array1_h(i,j) = array1_h(j,i); array1_h(j,i) = tmp
ENDDO
ENDDO
!$OMP END PARALLEL DO
print *, “after second host transpose”, array1_h
print *, "HOST/HOST transpose: Difference to original should be zero ", SUM(array1_h-array_orig_h)
!$acc data copy(array1_h)
!$ACC PARALLEL LOOP PRIVATE(i,j,tmp)
DO i=1, n
DO j=1, i-1
! Exchange element at array1_h(i,j)
tmp = array1_h(i,j); array1_h(i,j) = array1_h(j,i); array1_h(j,i) = tmp
ENDDO
ENDDO
!$ACC END PARALLEL LOOP
!$acc wait
print *, “after first device transpose”, array1_h
!$ACC PARALLEL LOOP PRIVATE(i,j,tmp)
DO i=1, n
DO j=1, i-1
! Exchange element at array1_h(i,j)
tmp = array1_h(i,j); array1_h(i,j) = array1_h(j,i); array1_h(j,i) = tmp
ENDDO
ENDDO
!$ACC END PARALLEL LOOP
!$acc wait
!$acc end data
print *, “result”, array1_h
print *, "DEVICE/DEVICE transpose: Difference to original should be zero but is not ", SUM(array1_h-array_orig_h)
DEALLOCATE( array_orig_h, array1_h )
END PROGRAM transpose_bug
Comments: the host output gives the expected result, namely
HOST/HOST transpose: Difference to original should be zero
0.000000000000000
The second device transpose yields:
result 1.001000000000000 2.001000000000000
3.002000000000000 1.002000000000000 2.002000000000000
3.002000000000000 2.003000000000000 2.003000000000000
3.003000000000000
DEVICE/DEVICE transpose: Difference to original should be zero but is not
1.001000000000000
One interesting result of looking at the dump with NVDEBUG=1 is that tmp seems to be allocated with cu_alloc, and is not in a register (as it should be):
__pgi_cu_module_function( name=0x6906b0=transpose_bug_66_gpu, lineno=66, argname=(nil)=, argsize=40, varname=(nil)=, varsize=0, SWcachesize=0 )
Function handle is 0x2015770
__pgi_cu_alloc(size=24,lineno=66,name=tmp)
__pgi_cu_alloc(24) returns 0x200200200 (address=0x7fff4cc18158)
The fact that tmp is allocated with 24 bytes (3x8) is suspicious: if a GPU private array were used instead of a register, it would be safer to have an n x n array.
I’ve discussed this at length with Jeff Poznanovic and have played around with the directives, e.g., adding an !$acc loop directive around the inner loop. The problem persists.
Hope you can give us some input.
Thanks!
–Will