Performance CUDA fortran

Please, consider the following simple code, which represents a “stupid” scheme of the code I’m porting in CUDA Fortran.

module mod_test
 use cudafor
 implicit none
 attributes(device) real*8 function calc2(i,j,k,l)
   implicit none
   integer, value :: i,j,k,l
   calc2 = (log(real(i))+log(real(j)))/exp(real(k))/exp(real(l))
 end function
 attributes(device) subroutine calc(counter,idx,Vettore)
   implicit none
   integer, value :: counter, idx
   real*8, dimension(counter),device :: Vettore
   real*8, dimension(32), shared :: vettore_shared
   real*8 :: var
   integer :: i,j,k,l, tid
   tid = threadidx%x
   do i=1,20
    do j=1,20
     do k=1,12
      do l=1,200   
         !var = calc2(i,j,k,l)  ! fast solution
         vettore_shared(tid) = calc2(i,j,k,l)  ! slow
         !Vettore(idx) = calc2(i,j,k,l)          ! slow
      end do
 end subroutine
 attributes(global) subroutine kernel_test(counter,Vettore)
   implicit none
   integer, value :: counter
   real*8, dimension(counter),device :: Vettore
   integer :: idx
   idx = (blockidx%x-1)*blockdim%x + threadidx%x
   call calc(counter,idx,Vettore)
 end subroutine
end module
program test
 use cudafor
 use mod_test
 implicit none
 integer :: nblocks, nthreads, counter
 integer :: c1, c2
 real*8, dimension(:), allocatable :: Vettore_host
 real*8, dimension(:), allocatable, device :: Vettore_dev
 counter = 9216
 nthreads = 32
 nblocks = counter/nthreads 
 call system_clock(count=c1)
 Vettore_dev = 0.d0
 Vettore_host = 0.d0
 call kernel_test<<<nblocks,nthreads>>>(counter,Vettore_dev)
 Vettore_host = Vettore_dev
 call system_clock(count=c2)
 write(*,*) 'time ', c2-c1  
end program

I have a routine which computes an element of the array Vettore_dev. If I update directly the device array Vettore_dev on the device subroutine calc, the code takes ~ 4 seconds to execute. The surprising fact for me is that if I use a shared array, i.e. vettore_shared, I still have ~ 4 sec of time to execute.
Why? I expected a much faster program using a shared memory array.
What’s wrong?

Thank you in advance for every comment!

Hi goblinqueen,

The compiler is smart enough to notice that the result from ‘var’ is never used, hence is optimizing away the work. This is why this version is so much faster. Instead, at the bottom of the last do loop, add “Vettore(idx) = var”. Note that the “var” version will still be faster then the other two since “var” can be held in a register.

Now as to the differences between shared and global, I’m still in the process of learning this myself so unfortunately don’t have any great insights. My experimentation shows little speed-up when using shared memory unless your able to reuse the memory many times. Why? I’m not sure yet.

Any other users have any insights?

  • Mat

Dear Mat,

How do you know, “var” can be held in a register?

Many Thanks!

Sin sin

Hi Sin sin,

I don’t know for sure, just that it can. Though, given that it’s a scalar with a very high degree of re-use and that I see a speed-up, it most likely is being placed in a register. Can I prove it? No. Most likely? Yes.

  • Mat

Hi Mat,

Thanks a lots! I just wonder is it possible to control the variables held in a register.

Sin sin

I just wonder is it possible to control the variables held in a register.

No. Register allocation is by the NVIDIA back end tools. Though I believe that the tools will try and put as many local variables in registers as possible. The only thing you can control is the maximum number of registers your program uses (See “-Mcuda=maxregcount:”).

  • Mat

Thank you Mat for your suggestion.
I’ll keep experimenting different solutions…