CUDA fortran VecAdd code failed to give correct results

king1 · October 20, 2009, 5:57am

I am getting incorrect results for following code.

========================================================================
MODULE vecAdd

CONTAINS
ATTRIBUTES(GLOBAL) SUBROUTINE deviceVecAdd(A, B, C, N)
REAL, DEVICE :: A(N), B(N), C(N)
INTEGER, VALUE :: N
INTEGER :: i
i = threadIdx%x
C(i) = A(i) + B(i)
END SUBROUTINE

END MODULE

PROGRAM main
USE cudafor
USE vecAdd
INTEGER, PARAMETER :: N = 1024
INTEGER :: i, nerrors, istat, idevice

! Vector data
REAL, DIMENSION(N) :: A, B, C, devResult
REAL, ALLOCATABLE, DEVICE, DIMENSION(:) :: dA, dB, dC
TYPE(cudaEvent) :: start, stop

TYPE(dim3) :: blocks
TYPE(dim3) :: threads

idevice = 0
istat = cudaSetDevice(idevice)

CALL random_number(A)
CALL random_number(B)

ALLOCATE(dA(N))
ALLOCATE(dB(N))
ALLOCATE(dC(N))

dA = A
dB = B
dC = 0.0

blocks = dim3(N/16, 1, 1)
threads = dim3(16, 1, 1)

! Host implementation
DO i = 1, N
C(i) = A(i) + B(i)
END DO

CALL deviceVecAdd<<<blocks, threads>>>(dA, dB, dC, N)
istat = cudaThreadSynchronize()
devResult = dC

nerrors = 0
DO i = 1, N
IF(abs(C(i) - devResult(i)) .gt. 1.0e-4) THEN
nerrors = nerrors + 1
END IF
END DO

IF(nerrors .eq. 0) THEN
PRINT *,“Test passed!”
ELSE
PRINT *," Test failed! ", "No of elements failed = ", nerrors
ENDIF

END PROGRAM

========================================================================

i don’t understand why it is giving incorrect results.

Let me know if i am missing any thing.

i am using emulation mode presently (pgfortran -o vecAdd -Mcuda=emu vecAdd.cuf)

MY system information

PGI - 9.0-4 Workstation Linux 64 bit
OS : OpenSUSE 10.3 64 bit

king1 · October 20, 2009, 6:12am

sorry

copy past error in deviceVecAdd kernel subroutine

It is working fine after following change

i = threadidx%x replaced by i = (blockidx%x - 1) * blockdim%x + threadIdx%x

king1 · October 20, 2009, 9:47am

could anybody please clarify why blockIdx%x should be subtracted with 1 to calculate unique thread id in a given grid and block?

See above code for more details.

TheMatt · October 20, 2009, 2:30pm

Well, it doesn’t have to be subtracted for a unique index, it will be unique, but if you don’t subtract 1, your unique index will not start at 1! The confusion–for me, at least–is due to the fact that Fortran indexes arrays from 1 and not 0. Thus, all the CUDA C tutorials are just a bit off with indexing on the GPU.

It makes sense if you layout a 1-D array. The best way I know to figure this out is to do a bit of exploratory print-out-ing. Let’s say you have a 16-element array:

|a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|

that is split up into 4 blocks (N=16, blockdim%x=4):

|a|b|c|d|   |e|f|g|h|    |i|j|k|l|    |m|n|o|p|

For each:

|a|b|c|d| -> blockidx%x = 1, blockdim%x=4, threadidx%x=1,2,3,4
|e|f|g|h| -> blockidx%x = 2, blockdim%x=4, threadidx%x=1,2,3,4
|i|j|k|l| -> blockidx%x = 3, blockdim%x=4, threadidx%x=1,2,3,4
|m|n|o|p| -> blockidx%x = 4, blockdim%x=4, threadidx%x=1,2,3,4

(Note, this is different than CUDA C where arrays start at 0, not 1 so threadIdx.x is 0…3 and blockIdx.x=0…3.)

Now, what we want is an index, say idx, that runs 1 to 16 where a->idx=1 and p->idx=16 to match Fortran indexing. In CUDA C, this indexing formula is usually:

Cindex = blockDim.x * blockIdx.x + threadIdx.x

which works because in C, blockDim.x starts at 0 and threadIdx.x starts at 0. So, you get that a->Cindex=04+0=0 and p->Cindex=34+3=15. But in CUDA Fortran, that results in a->idx=14+1=5 and p->idx=44+1=17. idx is unique to each element, but it’s shifted right from what we want.

The way around this is to make sure you subtract 1 from blockidx%x, thus:

(Findex=)idx=blockdim%x * (blockidx%x - 1) + threadidx%x

This, as you’ve found out and simple arithmetic shows, results in an index which has the first element at 1 and the last at 16, in this case.

I’m learning all this along with you, and made a quick program to help me rewire my brain. I just prints out an array that is assigned in various ways on the GPU (using blockidx%x, threadidx%x, etc.) It’s 1D, but I have a 2D version as well that I’m sure I’ll be referring to…a lot. I can post either if you’d like. (I’ve also made up a VIM syntax file for CUDA Fortran, so sure am I that I will be seeing these new variables, etc. in my life.)

king1 · October 21, 2009, 5:12am

TheMatt:

gen:

could anybody please clarify why blockIdx%x should be subtracted with 1 to calculate unique thread id in a given grid and block?

See above code for more details.

Well, it doesn’t have to be subtracted for a unique index, it will be unique, but if you don’t subtract 1, your unique index will not start at 1! The confusion–for me, at least–is due to the fact that Fortran indexes arrays from 1 and not 0. Thus, all the CUDA C tutorials are just a bit off with indexing on the GPU.

It makes sense if you layout a 1-D array. The best way I know to figure this out is to do a bit of exploratory print-out-ing. Let’s say you have a 16-element array:
|a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|
that is split up into 4 blocks (N=16, blockdim%x=4):
|a|b|c|d|   |e|f|g|h|    |i|j|k|l|    |m|n|o|p|
For each:
|a|b|c|d| -> blockidx%x = 1, blockdim%x=4, threadidx%x=1,2,3,4
|e|f|g|h| -> blockidx%x = 2, blockdim%x=4, threadidx%x=1,2,3,4
|i|j|k|l| -> blockidx%x = 3, blockdim%x=4, threadidx%x=1,2,3,4
|m|n|o|p| -> blockidx%x = 4, blockdim%x=4, threadidx%x=1,2,3,4
(Note, this is different than CUDA C where arrays start at 0, not 1 so threadIdx.x is 0…3 and blockIdx.x=0…3.)

Now, what we want is an index, say idx, that runs 1 to 16 where a->idx=1 and p->idx=16 to match Fortran indexing. In CUDA C, this indexing formula is usually:
Cindex = blockDim.x * blockIdx.x + threadIdx.x
which works because in C, blockDim.x starts at 0 and threadIdx.x starts at 0. So, you get that a->Cindex=04+0=0 and p->Cindex=34+3=15. But in CUDA Fortran, that results in a->idx=14+1=5 and p->idx=44+1=17. idx is unique to each element, but it’s shifted right from what we want.

The way around this is to make sure you subtract 1 from blockidx%x, thus:
(Findex=)idx=blockdim%x * (blockidx%x - 1) + threadidx%x
This, as you’ve found out and simple arithmetic shows, results in an index which has the first element at 1 and the last at 16, in this case.

I’m learning all this along with you, and made a quick program to help me rewire my brain. I just prints out an array that is assigned in various ways on the GPU (using blockidx%x, threadidx%x, etc.) It’s 1D, but I have a 2D version as well that I’m sure I’ll be referring to…a lot. I can post either if you’d like. (I’ve also made up a VIM syntax file for CUDA Fortran, so sure am I that I will be seeing these new variables, etc. in my life.)

TheMatt,

Thanks a lot. it is really great explanation. If you have already 2D version, post it.

TheMatt · October 23, 2009, 12:19pm

I’m posting my code below. In truth, it’s not too useful, but it’s a good simple program to wrap one’s head around messing with grids, blocks, etc. (Note: I’m sure the PGI folks will cringe at this code, probably CUDA Fortran oopses all over…)

module assign_mod
   use cudafor
contains
   attributes(global) subroutine assign_kernel(A, N)
      implicit none

      integer, value :: N
      integer, device, dimension(N,N) :: A
      integer, device :: idx, idy, id_vert, id_horiz

      integer, device :: bx, by, bdimx, bdimy
      integer, device :: tx, ty

      bx = blockidx%x
      bdimx = blockdim%x
      tx = threadidx%x

      by = blockidx%y
      bdimy = blockdim%y
      ty = threadidx%y

      ! You need blockidx%x-1 here otherwise you won't
      ! get the first blockdim%x of elements assigned!
      !
      ! That is, the first blockidx%x is 1, so, for the
      ! first thread of the first block:
      ! idx = 1*4+1 = 5 if you don't subtract one, so
      ! only elements 5 thru N are assigned!
      idx = (bx-1) * bdimx + tx
      idy = (by-1) * bdimy + ty
      
      ! These two number the array horizontally and
      ! vertically (print out to see)
      id_vert = (idy-1) * N + idx
      id_horiz = (idx-1) * N + idy

      if (idx <= N .and. idy <= N) then
         !A(idx,idy) = tx
         !A(idx,idy) = bx
         !A(idx,idy) = bdimx
         !A(idx,idy) = bdimx * bx
         !A(idx,idy) = bdimx * (bx-1)
         !A(idx,idy) = idx
         A(idx,idy) = id_horiz
         !A(idx,idy) = id_vert
      end if

   end subroutine assign_kernel
end module assign_mod

program main
   use cudafor
   use assign_mod

   implicit none

   integer, parameter :: n = 16
   integer, allocatable, dimension(:,:) :: a_host, b_host
   integer, device, allocatable, dimension(:,:) :: a_device

   type(dim3) :: dimGrid, dimBlock

   integer, parameter :: blocksize = 4

   integer :: i, j

   dimBlock = dim3(blocksize,blocksize,1)
   dimGrid = dim3(n/blocksize,n/blocksize,1)

   allocate(a_host(n,n))
   allocate(b_host(n,n))

   ! This is like a cudaMalloc
   allocate(a_device(n,n))

   ! Just put some ints in the host array that
   ! shouldn't appear in the assign_kernel
   forall (i=1:n,j=1:n)
      a_host(i,j) = 999
   end forall

   ! Implicit cudaMemcpyHostToDevice
   a_device = a_host

   ! Call the kernel
   call assign_kernel<<<dimGrid, dimBlock>>> (a_device, n)

   ! Implicit cudaMemcpyDeviceToHost
   b_host = a_device

   do i = 1, n
      do j = 1, n
         if (mod(j,n) == 0) then
            write (*,"(I3)") b_host(i,j)
         else
            write (*,"(I3,2X)",advance="no") b_host(i,j)
         end if
      end do
   end do

end program main

If you wish to see the effect that not using (blockidx%x-1) will have, try setting:

      idx = (bx) * bdimx + tx
      idy = (by) * bdimy + ty

You’ll see the appearance of “999” in your printout meaning that those elements were not assigned (i.e., they still carry the initial value assigned by the main loop’s FORALL).

Topic		Replies	Views
CudaFotran compiling problem When i am comipiling the cuda fortran code, type mismatch error is com CUDA Programming and Performance	13	3622	December 1, 2009
CUDA Fortran matrix-multiply 10x slower than CUDA C version Legacy PGI Compilers	5	6929	July 14, 2010
An Easy Introduction to CUDA Fortran Technical Blog	7	571	June 21, 2024
Iteration help in CUDA CUDA Programming and Performance	11	6892	April 19, 2012
Annoying problems with memory and/or syntax CUDA Programming and Performance	19	4769	April 8, 2008
Operators both on host and device functions Legacy PGI Compilers	21	10648	October 12, 2010
problem with array offset CUDA Programming and Performance	28	3228	February 3, 2011
An Easy Introduction to CUDA C and C++ Technical Blog	48	1255	July 19, 2018
how does this newbie fix his code? Legacy PGI Compilers	8	3415	March 27, 2012
Urgent help with threads please! CUDA Programming and Performance	21	10787	March 6, 2008

CUDA fortran VecAdd code failed to give correct results

i am using emulation mode presently (pgfortran -o vecAdd -Mcuda=emu vecAdd.cuf) MY system information

Related topics

i am using emulation mode presently (pgfortran -o vecAdd -Mcuda=emu vecAdd.cuf)

MY system information