need help with simple cuda test

vibrantcascade · August 4, 2011, 8:35pm

The following code is a test I’m trying to do to see how easy it’s going to be to adapt some serial scientific computing code to CUDA. I’m having some troubles getting it going. So could someone tell me if I’m doing this completely wrong, or if I can’t use device functions this way? (The math is irrelevant, I’m just trying to get something going to eat up time so I can see the improvement from using cuda.)

I’m also thinking I need to rework:
a = a + testFunc2(a, i)
as I’m not quite sure how that would execute and add up on the machine. In my non-cuda version of this I simply call testFunc1 and have a do loop from 1 to 10000 in that.

And on the line of:
b = testFunc1
There is actually a one hundred followed by a comma then another one hundred. For some reason the forum won’t allow me to enter a line like that.

      program test
      use cudafor
      double precision a, b,testFunc1
      real t0, t1, tdiff    

      call cpu_time(t0)
      a = 1.0000000d0
      b = testFunc1<<<100>>>(a)
      call cpu_time(t1)

      tdiff = t1 - t0

      write(6,*) a, tdiff
    
      call exit    
      end

      module cudaFuncs
      contains
        double precision attributes(device) function testFunc1(a)
        double precision a, testFunc2
        integer i
          i = (blockIdx%x-1)*blockDim%x + threadIdx%x
          a = a + testFunc2(a, i)
          call syncthreads()
          testFunc1 = a 
          return
        end

        double precision attributes(device) function testFunc2(a, i)
        double precision a, testFunc3
        integer i, j 
          a = a * i
          a = a / 3.d0
          do j = 1, 10000
            a = testFunc3(a)
          enddo
          testFunc2 = a
          return 
        end

        double precision attributes(device) function testFunc3(a)
        double precision a
        integer i
 
          do i = 1, 10000
            a = a * a
            a = sqrt(a)
            a = a * 3.d0
            a = a / 5.d0
          enddo
          testFunc3 = a
          return 
        end
      end module cudaFuncs

MatColgrove · August 4, 2011, 11:24pm

Hi vibrantcascade,

b = testFunc1
There is actually a one hundred followed by a comma then another one hundred. For some reason the forum won’t allow me to enter a line like that.

Correct. Only global kernel functions are callable from the host. Device functions can only be called from other device routines located in the same module.

Besides this, there are number of other problems. Foremost, this isn’t a parallel program. You could run this serially on a GPU, but your performance would be very poor. A single GPU core is pretty wimpy when compared to a CPU core. Speed-up is obtained by using lots of GPU cores. So if you need to calculate millions of a’s then, you’d see improvement, but not for just one.

Below is a modified version of your code which fixes many of the issues. Instead of going through each one, I’d like you to first read some of our PGInsider articles to get a better understand on how CUDA Fortran works. Also, try looking over the CUDA Fortran SDK for code examples. If you still have question, please feel free to ask.

Understanding the CUDA Data Parallel Threading Model
A Primer
Introduction to PGI CUDA Fortran
Tuning a Monte Carlo Algorithm on GPUs
CUDA Fortran Data Management
CUDA Fortran Device Kernels
CUDA Fortran: The Next Level

All of the PGInsider articles can be found at: Technical Articles and Publications | PGI
Also, the CUDA Fortran Programming Guide is good thing to have on hand PGI Documentation Archive for Versions Prior to 17.7

Hope this helps,
Mat

% cat testFuncs.cuf 
      module cudaFuncs
      contains
        attributes(global) subroutine testFunc1(a,nsize)
        double precision, dimension(:) :: a
        integer, value                 :: nsize
        integer i
          i = (blockIdx%x-1)*blockDim%x + threadIdx%x
        if (i .le. nsize) then
          a(i) = a(i) + testFunc2(a(i), i)
        end if
        end

        double precision attributes(device) function testFunc2(a, i)
        double precision a
        integer i, j
          a = a * i
          a = a / 3.d0
          do j = 1, 10000
            a = testFunc3(a)
          enddo
          testFunc2 = a
          return
        end

        double precision attributes(device) function testFunc3(a)
        double precision a
        integer i
          do i = 1, 10000
            a = a * a
            a = sqrt(a)
            a = a * 3.d0
            a = a / 5.d0
          enddo
          testFunc3 = a
          return
        end
      end module cudaFuncs 

      program test
      use cudafor
      use cudaFuncs
      integer, parameter :: NSIZE=1024
      integer, parameter :: BSIZE=256
      double precision b
      double precision, dimension(NSIZE)         :: a
      double precision, dimension(NSIZE), device :: aDev
      type(dim3) :: grid, block
      real t0, t1, tdiff   

      block = dim3(BSIZE,1,1)
      grid = dim3((NSIZE+BSIZE-1)/BSIZE,1,1)
      call cpu_time(t0)
      aDev = 1.0000000d0
      call testFunc1<<<grid>>>(aDev,NSIZE)
      a=aDev
      call cpu_time(t1)

      tdiff = t1 - t0

      write(6,*) a(1), a(NSIZE), tdiff
   
      call exit   
      end

vibrantcascade · August 5, 2011, 2:56pm

Thanks for the help! I’m new to fortran and was just trying to figure out how to use cuda functions as I couldn’t find a good simple example like this in the programming guide. (I’m mainly a c++/c# programmer.) I have the non-cuda version working fine and figured I might just be able to declare them all device functions and create a thousand threads in the call. This makes plenty of sense though now.

(A nice simple example like this with a few function calls in chapter 5 of the cuda programmers guide would be nice to add. The matrix multiplication just doesn’t cover enough.)

Thanks Mat!

vibrantcascade · August 5, 2011, 3:45pm

Ok 1 quick questions on this. I notice you only specified the <<>> and not the block size in the call. Will it then automatically create as many threads per block as your card can handle or will it implicitly recognize the block size of 256 even though it wasn’t passed in the call?

I believe the fermi tesla I’m using supports block sizes of 1024 so technically it could handle all 1024 values in 1 block even though this code is making 4 blocks from what I can tell.

Update: never mind, it appears the forum deletes anything after the comma in a triple <<<>>> block like that, just like what I ran into.

MatColgrove · August 5, 2011, 4:57pm

Ok 1 quick questions on this. I notice you only specified the <<>> and not the block size in the call.

Artifact of the forum. It should be “grid,block” but for some reason when code gets posted the “block” gets removed. I’ll try to dive into phpBB’s scripts to see if I can correct it.

I believe the fermi tesla I’m using supports block sizes of 1024 so technically it could handle all 1024 values in 1 block even though this code is making 4 blocks from what I can tell.

Correct, a Fermi can use up to 1024 threads per block. Note we have utility ‘pgaccelinfo’ which you can used to query your device’s properties.

NSIZE and BSIZE’s values are arbitrary. Feel free to make them bigger.

Mat