CUDA fortran VecAdd code failed to give correct results

I am getting incorrect results for following code.

========================================================================
MODULE vecAdd

CONTAINS
ATTRIBUTES(GLOBAL) SUBROUTINE deviceVecAdd(A, B, C, N)
REAL, DEVICE :: A(N), B(N), C(N)
INTEGER, VALUE :: N
INTEGER :: i
i = threadIdx%x
C(i) = A(i) + B(i)
END SUBROUTINE

END MODULE

PROGRAM main
USE cudafor
USE vecAdd
INTEGER, PARAMETER :: N = 1024
INTEGER :: i, nerrors, istat, idevice

! Vector data
REAL, DIMENSION(N) :: A, B, C, devResult
REAL, ALLOCATABLE, DEVICE, DIMENSION(:) :: dA, dB, dC
TYPE(cudaEvent) :: start, stop

TYPE(dim3) :: blocks
TYPE(dim3) :: threads

idevice = 0
istat = cudaSetDevice(idevice)

CALL random_number(A)
CALL random_number(B)

ALLOCATE(dA(N))
ALLOCATE(dB(N))
ALLOCATE(dC(N))

dA = A
dB = B
dC = 0.0

blocks = dim3(N/16, 1, 1)
threads = dim3(16, 1, 1)

! Host implementation
DO i = 1, N
C(i) = A(i) + B(i)
END DO

CALL deviceVecAdd<<<blocks, threads>>>(dA, dB, dC, N)
istat = cudaThreadSynchronize()
devResult = dC

nerrors = 0
DO i = 1, N
IF(abs(C(i) - devResult(i)) .gt. 1.0e-4) THEN
nerrors = nerrors + 1
END IF
END DO

IF(nerrors .eq. 0) THEN
PRINT *,“Test passed!”
ELSE
PRINT *," Test failed! ", "No of elements failed = ", nerrors
ENDIF

END PROGRAM

========================================================================

i don’t understand why it is giving incorrect results.

Let me know if i am missing any thing.

i am using emulation mode presently (pgfortran -o vecAdd -Mcuda=emu vecAdd.cuf)



MY system information

PGI - 9.0-4 Workstation Linux 64 bit
OS : OpenSUSE 10.3 64 bit

sorry

copy past error in deviceVecAdd kernel subroutine

It is working fine after following change

i = threadidx%x replaced by i = (blockidx%x - 1) * blockdim%x + threadIdx%x

could anybody please clarify why blockIdx%x should be subtracted with 1 to calculate unique thread id in a given grid and block?

See above code for more details.

Well, it doesn’t have to be subtracted for a unique index, it will be unique, but if you don’t subtract 1, your unique index will not start at 1! The confusion–for me, at least–is due to the fact that Fortran indexes arrays from 1 and not 0. Thus, all the CUDA C tutorials are just a bit off with indexing on the GPU.

It makes sense if you layout a 1-D array. The best way I know to figure this out is to do a bit of exploratory print-out-ing. Let’s say you have a 16-element array:

|a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|

that is split up into 4 blocks (N=16, blockdim%x=4):

|a|b|c|d|   |e|f|g|h|    |i|j|k|l|    |m|n|o|p|

For each:

|a|b|c|d| -> blockidx%x = 1, blockdim%x=4, threadidx%x=1,2,3,4
|e|f|g|h| -> blockidx%x = 2, blockdim%x=4, threadidx%x=1,2,3,4
|i|j|k|l| -> blockidx%x = 3, blockdim%x=4, threadidx%x=1,2,3,4
|m|n|o|p| -> blockidx%x = 4, blockdim%x=4, threadidx%x=1,2,3,4

(Note, this is different than CUDA C where arrays start at 0, not 1 so threadIdx.x is 0…3 and blockIdx.x=0…3.)

Now, what we want is an index, say idx, that runs 1 to 16 where a->idx=1 and p->idx=16 to match Fortran indexing. In CUDA C, this indexing formula is usually:

Cindex = blockDim.x * blockIdx.x + threadIdx.x

which works because in C, blockDim.x starts at 0 and threadIdx.x starts at 0. So, you get that a->Cindex=04+0=0 and p->Cindex=34+3=15. But in CUDA Fortran, that results in a->idx=14+1=5 and p->idx=44+1=17. idx is unique to each element, but it’s shifted right from what we want.

The way around this is to make sure you subtract 1 from blockidx%x, thus:

(Findex=)idx=blockdim%x * (blockidx%x - 1) + threadidx%x

This, as you’ve found out and simple arithmetic shows, results in an index which has the first element at 1 and the last at 16, in this case.

I’m learning all this along with you, and made a quick program to help me rewire my brain. I just prints out an array that is assigned in various ways on the GPU (using blockidx%x, threadidx%x, etc.) It’s 1D, but I have a 2D version as well that I’m sure I’ll be referring to…a lot. I can post either if you’d like. (I’ve also made up a VIM syntax file for CUDA Fortran, so sure am I that I will be seeing these new variables, etc. in my life.)

TheMatt,

Thanks a lot. it is really great explanation. If you have already 2D version, post it.

I’m posting my code below. In truth, it’s not too useful, but it’s a good simple program to wrap one’s head around messing with grids, blocks, etc. (Note: I’m sure the PGI folks will cringe at this code, probably CUDA Fortran oopses all over…)

module assign_mod
   use cudafor
contains
   attributes(global) subroutine assign_kernel(A, N)
      implicit none

      integer, value :: N
      integer, device, dimension(N,N) :: A
      integer, device :: idx, idy, id_vert, id_horiz

      integer, device :: bx, by, bdimx, bdimy
      integer, device :: tx, ty

      bx = blockidx%x
      bdimx = blockdim%x
      tx = threadidx%x

      by = blockidx%y
      bdimy = blockdim%y
      ty = threadidx%y

      ! You need blockidx%x-1 here otherwise you won't
      ! get the first blockdim%x of elements assigned!
      !
      ! That is, the first blockidx%x is 1, so, for the
      ! first thread of the first block:
      ! idx = 1*4+1 = 5 if you don't subtract one, so
      ! only elements 5 thru N are assigned!
      idx = (bx-1) * bdimx + tx
      idy = (by-1) * bdimy + ty
      
      ! These two number the array horizontally and
      ! vertically (print out to see)
      id_vert = (idy-1) * N + idx
      id_horiz = (idx-1) * N + idy

      if (idx <= N .and. idy <= N) then
         !A(idx,idy) = tx
         !A(idx,idy) = bx
         !A(idx,idy) = bdimx
         !A(idx,idy) = bdimx * bx
         !A(idx,idy) = bdimx * (bx-1)
         !A(idx,idy) = idx
         A(idx,idy) = id_horiz
         !A(idx,idy) = id_vert
      end if

   end subroutine assign_kernel
end module assign_mod

program main
   use cudafor
   use assign_mod

   implicit none

   integer, parameter :: n = 16
   integer, allocatable, dimension(:,:) :: a_host, b_host
   integer, device, allocatable, dimension(:,:) :: a_device

   type(dim3) :: dimGrid, dimBlock

   integer, parameter :: blocksize = 4

   integer :: i, j

   dimBlock = dim3(blocksize,blocksize,1)
   dimGrid = dim3(n/blocksize,n/blocksize,1)

   allocate(a_host(n,n))
   allocate(b_host(n,n))

   ! This is like a cudaMalloc
   allocate(a_device(n,n))

   ! Just put some ints in the host array that
   ! shouldn't appear in the assign_kernel
   forall (i=1:n,j=1:n)
      a_host(i,j) = 999
   end forall

   ! Implicit cudaMemcpyHostToDevice
   a_device = a_host

   ! Call the kernel
   call assign_kernel<<<dimGrid, dimBlock>>> (a_device, n)

   ! Implicit cudaMemcpyDeviceToHost
   b_host = a_device

   do i = 1, n
      do j = 1, n
         if (mod(j,n) == 0) then
            write (*,"(I3)") b_host(i,j)
         else
            write (*,"(I3,2X)",advance="no") b_host(i,j)
         end if
      end do
   end do

end program main

If you wish to see the effect that not using (blockidx%x-1) will have, try setting:

      idx = (bx) * bdimx + tx
      idy = (by) * bdimy + ty

You’ll see the appearance of “999” in your printout meaning that those elements were not assigned (i.e., they still carry the initial value assigned by the main loop’s FORALL).