CUDA Kernel code results differ from CPU results.

Pebbles1 · January 4, 2011, 10:57pm

Hello,

When I run the following CUDA code I get different results for the RAD_VEC than when it is run on the CPU. Can you please explain why this is happening?

PROGRAM testCUDA

USE GPU_KERNELS

REAL, ALLOCATABLE :: RAD_VEC(:,:)
REAL, DEVICE, ALLOCATABLE :: RAD_VEC_DEV(:,:)
INTEGER :: IBK, IWL
INTEGER, PARAMETER :: NWL = 224, NBCKGND = 64
REAL :: CL
REAL, DEVICE :: CL_DEV

ALLOCATE( RAD_VEC(NWL,NBCKGND), STAT=IOS)

DO IWL = 1,NWL
DO IBK = 1,NBCKGND
RAD_VEC(IWL,IBK)=2.0
END DO
END DO

ALLOCATE( RAD_VEC_DEV(NWL,NBCKGND) )

RAD_VEC_DEV = RAD_VEC(1:NWL, 1:NBCKGND)

CL = 0.0
CL_DEV = 0.0

!*** Begin Non-CUDA

! DO IBK = 1,NBCKGND
! CL=0.0
! DO IWL = 1,NWL
! CL=CL+RAD_VEC(IWL,IBK)**2
! END DO
! IF (CL<EPSMIN4) CL=1.0
! CL=SQRT(CL)
! DO IWL = 1,NWL
! RAD_VEC(IWL,IBK)=RAD_VEC(IWL,IBK)/CL
! END DO
! END DO

!*** End Non-CUDA

!*** Begin CUDA Calls

call TEST_KERNEL<<<(NBCKGND-1)/16+1,16>>>(RAD_VEC_DEV, NWL, NBCKGND, CL_DEV)

RAD_VEC(1:NWL,1:NBCKGND) = RAD_VEC_DEV

CL = CL_DEV

!*** End CUDA Calls

print *, "CL = ", CL

DO IBK = 1,NBCKGND
DO IWL = 1,NWL
IF ( IBK .EQ. 1 ) THEN
print *, RAD_VEC(IWL,IBK)
END IF
END DO
END DO

END PROGRAM testCUDA

module GPU_KERNELS
use cudafor

contains

attributes(global) subroutine TEST_KERNEL(RAD_VEC, NWL, NBCKGND, CL)

real, device :: RAD_VEC(NWL, NBCKGND), CL
integer, value :: NWL, NBCKGND
integer :: tx, ibk, iwl, i
real, parameter :: EPSMIN4 = 1.1754944E-38

tx = threadidx%x

i = ( blockidx%x-1 ) * blockdim%x + tx

if ( i .le. NBCKGND ) then

do iwl = 1,NWL
CL = CL + RAD_VEC(iwl, i)
end do
if ( CL < EPSMIN4 ) CL = 1.0
CL=SQRT(CL)
do iwl = 1, NWL
RAD_VEC(iwl,i) = RAD_VEC(iwl,i) + RAD_VEC(iwl, i)/CL
end do
end if
call syncthreads()

end subroutine

end module GPU_KERNELS

MatColgrove · January 5, 2011, 12:44am

Hi Pebbles,

When I run the following CUDA code I get different results for the RAD_VEC than when it is run on the CPU. Can you please explain why this is happening?

Can you please give me more details? If you mean that the CUDA Fortran version is different than the CPU version commented out, then the main problem is that the two algorithms are different so will produce different output.

Other than that I see two other bugs in your code. First, your launch configuration is incorrect. I changed it to:

call TEST_KERNEL<<<(NBCKGND+15)/16,16>>>(RAD_VEC_DEV, NWL, NBCKGND)

Secondly, you pass in a global scalar, CL_DEV, that all threads modify and use.

I’ve modify your code below to better match the commented out CPU version and fix the two bugs. I’m not sure it’s exactly what you want, but I do show it matching the commented out CPU version.

Hope this helps,
Mat

% cat test.cuf
module GPU_KERNELS
use cudafor

contains

attributes(global) subroutine TEST_KERNEL(RAD_VEC, NWL, NBCKGND)

real, device :: RAD_VEC(NWL, NBCKGND)
integer, value :: NWL, NBCKGND
real :: CL
integer :: tx, ibk, iwl, i
real, parameter :: EPSMIN4 = 1.1754944E-38

tx = threadidx%x

i = ( blockidx%x-1 ) * blockdim%x + tx

! DO IBK = 1,NBCKGND
! CL=0.0
! DO IWL = 1,NWL
! CL=CL+RAD_VEC(IWL,IBK)**2
! END DO
! IF (CL<EPSMIN4) CL=1.0
! CL=SQRT(CL)
! DO IWL = 1,NWL
! RAD_VEC(IWL,IBK)=RAD_VEC(IWL,IBK)/CL
! END DO
! END DO


if ( i .le. NBCKGND ) then
CL=0.0
do iwl = 1,NWL
CL = CL + RAD_VEC(iwl, i)**2
end do
if ( CL < EPSMIN4 ) CL = 1.0
CL=SQRT(CL)
do iwl = 1, NWL
RAD_VEC(iwl,i) = RAD_VEC(iwl, i)/CL
end do
end if
call syncthreads()

end subroutine

end module GPU_KERNELS

PROGRAM testCUDA

USE GPU_KERNELS

REAL, ALLOCATABLE :: RAD_VEC(:,:)
REAL, DEVICE, ALLOCATABLE :: RAD_VEC_DEV(:,:)
INTEGER :: IBK, IWL
INTEGER, PARAMETER :: NWL = 224, NBCKGND = 64
REAL :: CL
REAL, DEVICE :: CL_DEV

ALLOCATE( RAD_VEC(NWL,NBCKGND), STAT=IOS)

DO IWL = 1,NWL
DO IBK = 1,NBCKGND
RAD_VEC(IWL,IBK)=2.0
END DO
END DO

ALLOCATE( RAD_VEC_DEV(NWL,NBCKGND) )

RAD_VEC_DEV = RAD_VEC(1:NWL, 1:NBCKGND)

CL = 0.0

!*** Begin Non-CUDA

! DO IBK = 1,NBCKGND
! CL=0.0
! DO IWL = 1,NWL
! CL=CL+RAD_VEC(IWL,IBK)**2
! END DO
! IF (CL<EPSMIN4) CL=1.0
! CL=SQRT(CL)
! DO IWL = 1,NWL
! RAD_VEC(IWL,IBK)=RAD_VEC(IWL,IBK)/CL
! END DO
! END DO

!*** End Non-CUDA

!*** Begin CUDA Calls

call TEST_KERNEL<<<(NBCKGND+15)/16,16>>>(RAD_VEC_DEV, NWL, NBCKGND)

RAD_VEC(1:NWL,1:NBCKGND) = RAD_VEC_DEV


!*** End CUDA Calls

print *, "CL = ", CL

DO IBK = 1,NBCKGND
DO IWL = 1,NWL
IF ( IBK .EQ. 1 ) THEN
print *, RAD_VEC(IWL,IBK)
END IF
END DO
END DO

END PROGRAM testCUDA

% pgf90 -O2 test.cuf -Mcuda -o gpu1 -V11.0; gpu1
 CL =     0.000000    
   6.6815309E-02
   6.6815309E-02
   6.6815309E-02
   6.6815309E-02
... continues.

Topic		Replies	Views
CUDA innacuracy? CUDA float produces different result from CPU float CUDA Programming and Performance	8	3167	September 9, 2011
Why do I have the problem of different results every time when I use CUDA for calculations？ CUDA Programming and Performance	5	325	July 24, 2023
Error for parrallelise this c++ code CUDA Programming and Performance	1	611	August 18, 2015
Cuda giving wrong result CUDA Programming and Performance cuda	1	338	May 4, 2020
Why these two scripts give different results? CUDA Programming and Performance	5	1867	June 19, 2012
Problem with kernel output CUDA Programming and Performance	5	1211	February 18, 2016
A strange bug about CUDA computing CUDA Programming and Performance	5	5297	December 15, 2007
Different results between GPU and CPU different when program runs on Tesla card and same results wh CUDA Programming and Performance	7	1130	October 8, 2010
CUDA code giving wrong result CUDA Programming and Performance	0	422	May 4, 2020
different results (cuda\fortran) CUDA Programming and Performance	3	898	July 9, 2013

CUDA Kernel code results differ from CPU results.

Related topics