basic CUDA help

Hi, I’m playing around with CUDA for the first time and I’m trying to replace the main loop in a subroutine with a cuda kernel. I’m not getting correct numbers so I’m wondering if someone could just check for glaring CUDA mistakes. I also know its probably written poorly but for now I’m just trying to get it working. Also, when I compile with -fast I sometimes get:

0: copyout Memcpy (host=0xe3e148, dev=0x200100000, size=4356) FAILED: 4(unspecified launch failure)

Code below. It looks a bit long but a lot of it is calculations you can probably ignore. Thanks for any help!

The host code:

         subroutine grid

c    source quantities are are calculated: n_i
c    right now only ion quantitities are calculated...
c
         use slabcuda
         use cudafor
         include  'slab.h'
c
         integer i,j

c arrays being used in loop declared on device
         real, device :: den_d(0:im,0:jm),w3_d(mm),mu_d(mm)
         real, device :: x3_d(mm),rwx_d(4),y3_d(mm),rwy_d(4)
c gridDim defines geometry (& #) of blocks in the grid
c blockDim defines geometry (& #) of threads in the block
c grid block is 1D since we have threads divided over just mm
         type(dim3) :: bGrid, tBlock
         tBlock = dim3(256,1,1) ! 256 threads per block
         bGrid = dim3(ceiling(real(mm)/tBlock%x),1,1)

c     here we set the rho and den equal to zero.
           do 50 i=0,im
              do 60 j=0,jm
                    den(i,j)=0.
 60           continue
 50        continue

c
            dv=(dx*dy)

c transfer arrays from host to device
           den_d = den
           w3_d = w3
           mu_d = mu
           x3_d = x3
           rwx_d = rwx
           y3_d = y3
           rwy_d = rwy
c launch kernel
           call gridloop<<<bGrid, tBlock>>>(den_d,w3_d,mu_d,
     &                                       x3_d,rwx_d,y3_d,rwy_d,
     &                                       mm,dv,lr,
     &                                       mims,
     &                                       lx,ly,dx,dy)

c transfer den from device back to host
           den = den_d
              do 300 j=0,jm
                    den(0,j)=( den(0,j)+den(im,j) )
                    den(im,j)=den(0,j)
  300          continue
c
               do 320 i=0,im
                     den(i,0)=(den(i,0)+den(i,jm))
                     den(i,jm)=den(i,0)
  320               continue
c
                  do 410 i=0,im
                     do 420 j=0,jm
                           den(i,j)=q*den(i,j)/n0
  420                 continue
  410              continue
c
        return
        end

The device code (module):

         module slabcuda
         contains
           attributes(global) subroutine gridloop(den,w3,mu,x3,rwx,y3,
     &                                                         rwy,
     &                                             mm,dv,lr,
     &                                             mims,
     &                                             lx,ly,dx,dy)

             implicit none
             real :: den(:,:),w3(:),mu(:),x3(:),rwx(:),y3(:),rwy(:)
             integer :: m,l,i,j,istat
             real, value :: wx0,wx1,wy0,wy1,wght,xt,yt,rhog,lx,ly,dx,dy
             real, value :: dv,mims
             integer, value :: mm,lr

             m = blockDim%x * (blockIdx%x - 1) + threadIdx%x
             if (m<=mm) then

                wght=w3(m)/dv/float(lr)
                rhog=sqrt(mu(m))/mims

               do l=1,lr
                 xt=x3(m)+rwx(l)*rhog
                 yt=y3(m)+rwy(l)*rhog
c
                 if(xt.lt.0.)  xt=-xt
                 if(xt.gt.lx)  xt=2.*lx-xt
                 if(xt.eq.lx)  xt=0.99999*lx
                 if(yt.ge.ly) yt=yt-ly
                 if(yt.le.0.)  yt=yt+ly
                 if(yt.eq.ly)  yt=0.99999*ly


                  i=int(xt/dx)
                  j=int(yt/dy)
c
                  wx0=float(i+1)-xt/dx
                  wx1=1.-wx0
                  wy0=float(j+1)-yt/dy
                  wy1=1.-wy0
c
                  istat = atomicadd(den(i,j), wght*wx0*wy0)
                  istat = atomicadd(den(i+1,j), wght*wx1*wy0)
                  istat = atomicadd(den(i,j+1), wght*wx0*wy1)
                  istat = atomicadd(den(i+1,j+1), wght*wx1*wy1)

               enddo
             endif
           end subroutine gridloop
         end module slabcuda

Hi Brush,

The error you’re seeing is a generic message meaning that your kernel fail for some reason. Though, I don’t see anything obvious as to why. Two things to try:

  1. Add error checking after your kernel call:
           call gridloop<<<bGrid, tBlock>>>(den_d,w3_d,mu_d,
     &                                       x3_d,rwx_d,y3_d,rwy_d,
     &                                       mm,dv,lr,
     &                                       mims,
     &                                       lx,ly,dx,dy) 
           err = cudaGetLastError()
           if (err .ne. 0) then
                  write(*,*) 'Error:', cudaGetErrorString(err)
           endif
  1. Compile with “-g -Mcuda=emu” and then use PGDBG to debug the code in emulation mode. Note everything works the same way in emulation, but it might find the error.

If neither helps, please post or send to PGI Customer Service (trs@pgroup.com) a reproducible example that I can use to investigate.

  • Mat

Another question: in the host code, after the kernel call, the host keeps running so it seems the following den calculations would be inaccurate since the kernel is still updating the den array.

So it seems like adding whatever = cudaThreadSynchronize() after the kernel call would fix this / change the output, but this doesn’t happen, why?

Hi brush,

You are correct in that kernels are launched asynchronously to the host code. However, the host will block on the “den=den_d” statement and wait for the copy to complete before proceeding. Hence, adding cudaThreadSynchronize would make no difference.

  • Mat

Thanks Mat.

I made a stand-alone program, pasted below, to test atomicadd. Basically, each of 25 threads simultaneously adds 1 to each element of a 5x5 array, initialized to 0. But my program isn’t working as expected.

I pasted the output below: The second list in the output is the final array, which should be filled with all 25’s, but for some reason isn’t. The first list is the device array, which is printed from a single random thread. I have a dummy loop to take up some time so the device array is filled to what should be all 25s before the list is printed. I thought both should be identical, but they’re not. They’re also both not filled with all 25s, like I thought they should be. Any idea what is going on?

Device code (module):

      module test_mod
      contains
        attributes(global) subroutine test_sub(den,a,mm)
          real :: den(:,:)
          integer :: i,j,m,istat
          integer,value :: mm
          real,value :: a
          m = blockDim%x * (blockIdx%x - 1) + threadIdx%x
          if (m<=mm) then
            do i=0,4
               do j=0,4
                 istat = atomicadd(den(i,j),a)
                   if(m==1) then
                     do istat=1,999999 !allow other threads to finish
                     enddo
                     print *,i,j,den(i,j)
                   endif
               enddo
            enddo
          endif
      end subroutine test_sub
      end module test_mod

Host code (program):

      PROGRAM test_prog

      use test_mod
      use cudafor

      integer mm,i,j
      real a
      real :: den(0:4,0:4)
      real, device :: den_d(0:4,0:4)
      type(dim3) :: bGrid, tBlock
      mm=25
      tBlock = dim3(256,1,1)
      bGrid = dim3(ceiling(real(mm)/tBlock%x),1,1)

      a=1.

      do i=0,4
         do j=0,4
            den(i,j)=0.
         enddo
      enddo

      den_d = den
      call test_sub<<<bGrid, tBlock>>>(den_d,a,mm)
      den = den_d

      do i=0,4
         do j=0,4
            print *, i,j,den(i,j)
         enddo
      enddo
      write(*,*) 'Max error: ', maxval(abs(den-25.0))

      END

Output:
i j den(i,j) <— printed by a random thread on the device

 0 0 332.699280
 0 1 331.999420
 0 2 25.000000
 0 3 25.000000
 0 4 25.000000
 1 0 328.685303
 1 1 25.000000
 1 2 25.000000
 1 3 25.000000
 1 4 25.000000
 2 0 332.804474
 2 1 25.000000
 2 2 25.000000
 2 3 25.000000
 2 4 25.000000
 3 0 323.017395
 3 1 25.000000
 3 2 25.000000
 3 3 25.000000
 3 4 25.000000
 4 0 327.530579
 4 1 25.000000
 4 2 25.000000
 4 3 25.000000
 4 4 25.000000

i j den(i,j) <— printed by the host, after the device array has been copied back

            0            0    25.00000    
            0            1    25.00000    
            0            2    25.00000    
            0            3    25.00000    
            0            4    0.000000    
            1            0    25.00000    
            1            1    25.00000    
            1            2    25.00000    
            1            3    25.00000    
            1            4    0.000000    
            2            0    25.00000    
            2            1    25.00000    
            2            2    25.00000    
            2            3    25.00000    
            2            4    0.000000    
            3            0    25.00000    
            3            1    25.00000    
            3            2    25.00000    
            3            3    25.00000    
            3            4    0.000000    
            4            0    25.00000    
            4            1    25.00000    
            4            2    25.00000    
            4            3    0.000000    
            4            4    0.000000