CUDA Kernel code results differ from CPU results.

Hello,

When I run the following CUDA code I get different results for the RAD_VEC than when it is run on the CPU. Can you please explain why this is happening?

PROGRAM testCUDA

USE GPU_KERNELS

REAL, ALLOCATABLE :: RAD_VEC(:,:)
REAL, DEVICE, ALLOCATABLE :: RAD_VEC_DEV(:,:)
INTEGER :: IBK, IWL
INTEGER, PARAMETER :: NWL = 224, NBCKGND = 64
REAL :: CL
REAL, DEVICE :: CL_DEV

ALLOCATE( RAD_VEC(NWL,NBCKGND), STAT=IOS)

DO IWL = 1,NWL
DO IBK = 1,NBCKGND
RAD_VEC(IWL,IBK)=2.0
END DO
END DO

ALLOCATE( RAD_VEC_DEV(NWL,NBCKGND) )

RAD_VEC_DEV = RAD_VEC(1:NWL, 1:NBCKGND)

CL = 0.0
CL_DEV = 0.0

!*** Begin Non-CUDA

! DO IBK = 1,NBCKGND
! CL=0.0
! DO IWL = 1,NWL
! CL=CL+RAD_VEC(IWL,IBK)**2
! END DO
! IF (CL<EPSMIN4) CL=1.0
! CL=SQRT(CL)
! DO IWL = 1,NWL
! RAD_VEC(IWL,IBK)=RAD_VEC(IWL,IBK)/CL
! END DO
! END DO

!*** End Non-CUDA

!*** Begin CUDA Calls

call TEST_KERNEL<<<(NBCKGND-1)/16+1,16>>>(RAD_VEC_DEV, NWL, NBCKGND, CL_DEV)

RAD_VEC(1:NWL,1:NBCKGND) = RAD_VEC_DEV

CL = CL_DEV

!*** End CUDA Calls

print *, "CL = ", CL

DO IBK = 1,NBCKGND
DO IWL = 1,NWL
IF ( IBK .EQ. 1 ) THEN
print *, RAD_VEC(IWL,IBK)
END IF
END DO
END DO

END PROGRAM testCUDA


module GPU_KERNELS
use cudafor

contains

attributes(global) subroutine TEST_KERNEL(RAD_VEC, NWL, NBCKGND, CL)

real, device :: RAD_VEC(NWL, NBCKGND), CL
integer, value :: NWL, NBCKGND
integer :: tx, ibk, iwl, i
real, parameter :: EPSMIN4 = 1.1754944E-38

tx = threadidx%x

i = ( blockidx%x-1 ) * blockdim%x + tx

if ( i .le. NBCKGND ) then

do iwl = 1,NWL
CL = CL + RAD_VEC(iwl, i)
end do
if ( CL < EPSMIN4 ) CL = 1.0
CL=SQRT(CL)
do iwl = 1, NWL
RAD_VEC(iwl,i) = RAD_VEC(iwl,i) + RAD_VEC(iwl, i)/CL
end do
end if
call syncthreads()

end subroutine

end module GPU_KERNELS

Hi Pebbles,

When I run the following CUDA code I get different results for the RAD_VEC than when it is run on the CPU. Can you please explain why this is happening?

Can you please give me more details? If you mean that the CUDA Fortran version is different than the CPU version commented out, then the main problem is that the two algorithms are different so will produce different output.


Other than that I see two other bugs in your code. First, your launch configuration is incorrect. I changed it to:

call TEST_KERNEL<<<(NBCKGND+15)/16,16>>>(RAD_VEC_DEV, NWL, NBCKGND)

Secondly, you pass in a global scalar, CL_DEV, that all threads modify and use.

I’ve modify your code below to better match the commented out CPU version and fix the two bugs. I’m not sure it’s exactly what you want, but I do show it matching the commented out CPU version.

Hope this helps,
Mat


% cat test.cuf
module GPU_KERNELS
use cudafor

contains

attributes(global) subroutine TEST_KERNEL(RAD_VEC, NWL, NBCKGND)

real, device :: RAD_VEC(NWL, NBCKGND)
integer, value :: NWL, NBCKGND
real :: CL
integer :: tx, ibk, iwl, i
real, parameter :: EPSMIN4 = 1.1754944E-38

tx = threadidx%x

i = ( blockidx%x-1 ) * blockdim%x + tx

! DO IBK = 1,NBCKGND
! CL=0.0
! DO IWL = 1,NWL
! CL=CL+RAD_VEC(IWL,IBK)**2
! END DO
! IF (CL<EPSMIN4) CL=1.0
! CL=SQRT(CL)
! DO IWL = 1,NWL
! RAD_VEC(IWL,IBK)=RAD_VEC(IWL,IBK)/CL
! END DO
! END DO


if ( i .le. NBCKGND ) then
CL=0.0
do iwl = 1,NWL
CL = CL + RAD_VEC(iwl, i)**2
end do
if ( CL < EPSMIN4 ) CL = 1.0
CL=SQRT(CL)
do iwl = 1, NWL
RAD_VEC(iwl,i) = RAD_VEC(iwl, i)/CL
end do
end if
call syncthreads()

end subroutine

end module GPU_KERNELS

PROGRAM testCUDA

USE GPU_KERNELS

REAL, ALLOCATABLE :: RAD_VEC(:,:)
REAL, DEVICE, ALLOCATABLE :: RAD_VEC_DEV(:,:)
INTEGER :: IBK, IWL
INTEGER, PARAMETER :: NWL = 224, NBCKGND = 64
REAL :: CL
REAL, DEVICE :: CL_DEV

ALLOCATE( RAD_VEC(NWL,NBCKGND), STAT=IOS)

DO IWL = 1,NWL
DO IBK = 1,NBCKGND
RAD_VEC(IWL,IBK)=2.0
END DO
END DO

ALLOCATE( RAD_VEC_DEV(NWL,NBCKGND) )

RAD_VEC_DEV = RAD_VEC(1:NWL, 1:NBCKGND)

CL = 0.0

!*** Begin Non-CUDA

! DO IBK = 1,NBCKGND
! CL=0.0
! DO IWL = 1,NWL
! CL=CL+RAD_VEC(IWL,IBK)**2
! END DO
! IF (CL<EPSMIN4) CL=1.0
! CL=SQRT(CL)
! DO IWL = 1,NWL
! RAD_VEC(IWL,IBK)=RAD_VEC(IWL,IBK)/CL
! END DO
! END DO

!*** End Non-CUDA

!*** Begin CUDA Calls

call TEST_KERNEL<<<(NBCKGND+15)/16,16>>>(RAD_VEC_DEV, NWL, NBCKGND)

RAD_VEC(1:NWL,1:NBCKGND) = RAD_VEC_DEV


!*** End CUDA Calls

print *, "CL = ", CL

DO IBK = 1,NBCKGND
DO IWL = 1,NWL
IF ( IBK .EQ. 1 ) THEN
print *, RAD_VEC(IWL,IBK)
END IF
END DO
END DO

END PROGRAM testCUDA

% pgf90 -O2 test.cuf -Mcuda -o gpu1 -V11.0; gpu1
 CL =     0.000000    
   6.6815309E-02
   6.6815309E-02
   6.6815309E-02
   6.6815309E-02
... continues.