Hello ( ;-) Same player … )
The !$acc … unroll(X) produce wrong result with pgi]/12.X … pgi 13.4 ,
OK with pgi/11.10
I’ve discover it with the simple stream benchmark very my new Titan card
deliver 337BG/sec over the 288BG/sec possible !
The simple code comparing the same computation done on host versus device with unroll(8)
The code
PROGRAM TEST_UNROLL
IMPLICIT NONE
INTEGER,PARAMETER :: n = 1024*1024
REAL, DIMENSION(n) :: ad,bd,cd,ah,bh,ch
INTEGER,PARAMETER :: NUNROLL = 8
INTEGER :: i
! Init host & device array in the same way
do i=1,n
ah(i) = 0.0 ; bh(i) = 0.5 * i*i ; ch(i) = 0.25 * i*i*i
end do
ad(:) = ah(:) ; bd(:) = bh(:) ; cd(:) = ch(:)
! Host part
do i=1,n
ah(i) = bh(i) + ch(i)
end do
! Device part with unrolling
! acc kernels loop gang, vector unroll(NUNROLL)
!$acc region do parallel unroll(NUNROLL)
do i=1,n
ad(i) = bd(i) + cd(i)
end do
print*, "ERR(Device - Host) =", ad(n) - ah(n) ; call flush(6)
END PROGRAM TEST_UNROLL
Compilation pgi11.10
pgf90 --version -Minfo=acc -ta=host,nvidia,keepgpu test_unroll.f90 -o test_unroll_pgi11.10 2>&1 | egrep 'target|unroll'
test_unroll:
27, !$acc do parallel unroll(8), vector(256) ! blockidx%x threadidx%x
pgf90 11.10-0 64-bit target on x86-64 Linux -tp nehalem
Execution pgi11.10
test_unroll_pgi11.10
ERR(Device - Host) = 0.000000
Compilation with pgi 12.X until 13.4
pgf90 --version -Minfo=acc -ta=host,nvidia,keepgpu test_unroll.f90 -o test_unroll_pgi13.04 2>&1 | egrep 'target|unroll'
test_unroll:
27, !$acc loop gang unroll(8), vector(128) ! blockidx%x threadidx%x
pgf90 13.4-0 64-bit target on x86-64 Linux -tp nehalem
Execution pgi13.04
test_unroll_pgi13.04
ERR(Device - Host) = -2.8823089E+17
A+
Juan
PS1 : the bug is the same with pgi or OpenACC directives
PS2 : looking at the gpu code the compiler forgot to increment the pointer to the ad bd & cd array