Getting runtime from within kernal

Is there a way to time how long it takes for something to run within the kernel? I’ve looked at events but those appear to be outside the kernel, whereas I need to check how long a device call may take within a global call. Something like date_and_time or even just time()

The closest mechanism is clock() and clock64(), described here. This can occasionally be a bit confusing because the compiler may still re-order instructions. Also note that this measurement (like all other device code) happens per-thread. You can find many discussions about using these on this forum and other forums

Oh dear, I was worried you would say that. I’m afraid I tried clock earlier today and got a compiler issue. Specifically I got the error
nvlink error : Undefined reference to ‘clock’ in ‘/tmp/pgfortranuEIm8j4mp4Qb.o’
pgnvd-Fatal-cnvlink completed with exit code 255

while trying to do

integer :: timeGPU
timeGPU = clock()

and my compiler call is
pgfortran -Mcuda -o CRAFT_GPU_Nt CRAFT_GPU_Nt-1.f90 -v

https://docs.nvidia.com/hpc-sdk/pgi-compilers/18.7/x86/cuda-fortran-prog-guide/index.htm#cfref-fort-mods-dev-mods

And if it were me, I would use clock64, only.

Looks like clock64 compiled correctly. clock() still causes that same error.

also, one more quick question, is clock64 supposed to output negative numbers? I’m not familiar with timing using clock cycles instead of time. Am I putting the output into the wrong type of variable?

No, it doesn’t output negative numbers. It outputs a 64-bit unsigned integer, so interpreting it as a signed integer is technically wrong. I’d be very surprised, though, if the actual output rolled over the 64-bit signed integer max value. So, I don’t really know what is happening in your case. I think if you are jamming it into a 32-bit integer, than that is probably the issue.

I’ve tried changing the variable from integer::timeGPU to integer(8)::timeGPU but that appears to cause copyout Memcpy (host=0x0x2207680, dev=0x0x2300b20a00, size=8) FAILED: 77(an illegal memory access was encountered) and I’m not sure why since that’s not happened before. It’s also causing it to compile much much faster which makes me think something may be wrong with the way I’m declaring variables? Is there a different way to specify integer(8)? I’ve been using real*8 successfully until so I assumed interger(8) would work like normal.

Update, I removed the print statement it looks like everything is back to normal. I guess printing the integer(8) is causing problems, which I wouldn’t have guessed since printing real*8 works fine, but I digress. Is there an easy way to output this from within the kernal is printing is causes issues, or is that fact it cannot print them itself a problem?

Update: I removed the print statement it looks like everything is back to normal. I guess printing the integer(8) is causing problems, which I wouldn’t have guessed since printing real*8 works fine, but I digress. Is there an easy way to output this from within the kernel or is printing itself the problem, or is that fact it cannot print abnormal?

I believe you can print from a CUDA Fortran kernel.

You can print from CUDA Fortran kernels just fine, I do it all the time, but for some reason trying to print the integer(8) I put the timeGPU into causes a memory issue.

I don’t seem to have any trouble with it:

$ cat t11a.cuf
module mmk
   USE cudafor
   !
   ! Definition of symbols for real types (RP)
   !
   IMPLICIT NONE
   !
   INTEGER, PARAMETER :: SP = SELECTED_REAL_KIND(6,   37)     ! REAL32
   INTEGER, PARAMETER :: DP = SELECTED_REAL_KIND(15, 307)     ! REAL64
   INTEGER, PARAMETER :: SIZE_ = 1024
   !

   Contains

        attributes(global) subroutine matrixMultiply(d_mat, d_matT, d_matSym)
          USE cudadevice

          integer :: tx, ty, k
          integer(8) :: dt
          REAL(DP) :: accum
          REAL(DP), dimension(:) :: d_mat, d_matT, d_matSym

          tx = threadIdx%x + (blockIdx%x - 1) * blockDim%x
          ty = threadIdx%y + (blockIdx%y - 1) * blockDim%y

          if (tx <= SIZE_ .and. ty <=SIZE_) then
            accum = 0.0
            dt = clock64()
            do k=1, SIZE_
              accum = accum + d_mat((ty-1)*SIZE_+k) * d_matT((k-1)*SIZE_+tx)
            end do
            dt = clock64()-dt
            d_matSym((ty-1)*SIZE_+tx) = accum
            if (tx == 1 .and. ty == 1) then
                print *,dt
            end if
          end if
        end subroutine matrixMultiply


end module mmk

PROGRAM Test
   !
   ! This is the main program for Test
   !
   USE cudafor
   USE mmk

   !
   IMPLICIT NONE
   !
   REAL(DP), ALLOCATABLE, DEVICE, DIMENSION(:)     :: d_mat, d_matT, d_matSym
   !
   INTEGER    :: err, i1, i2
   type(dim3) :: grid_dim, blk_dim
   !
   ! Allocate storage for the arrays
   !
   Allocate(d_mat(SIZE_*SIZE_),d_matT(SIZE_*SIZE_),d_matSym(SIZE_*SIZE_))
   !
   ! invoke the kernel
   !

   !Call
   grid_dim = dim3(SIZE_/32, SIZE_/32, 1)
   blk_dim = dim3(32, 32, 1)
   call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
   call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
   err = cudaDeviceSynchronize()


   !
   ! Free storage for the arrays
   !
   Deallocate(d_mat,d_matT,d_matSym)
   !
END PROGRAM Test
$ nvfortran t11a.cuf -o t11a
$ ./t11a
                   301727
                   290399
$

How strange, I guess I’ll do some more investigating.

It’s the strangest thing. I’ve tested with your version and mine and both of them seem to cause the same

0: DEV_MKDESC: allocate FAILED:77(an illegal memory access was encountered)

error but ONLY when printing. If I remove the print it’s fine. Is it possible I’m missing a package or some setting somewhere? Is it possible that my compiling options could be causing the problem? I’m compiling using this pgfortran -Mcuda -o CRAFT_GPU_Nt timePrintTest.cuf -v

so your code/file timePrintTest.cuf is an exact duplicate of the t11a.cuf that I posted?

Mine doesn’t have the
$cat

or the
&nvfortran
&./t11a
but aside from that they are identical. I did try commenting out the print to see if that would make it run and it has.

I also tried compiling it as a .f90 and that didn’t work either

At this point, I only have very general suggestions. No idea what is wrong exactly.

  1. verify the general CUDA fortran capability. Do known good codes such as vector add work correctly?
  2. run your code with cuda-memcheck, it may give some clues.
  3. ask your question on the forum for pgi legacy compilers (which is what you are using) here.

Will do, I appreciate your time. I’ll let you know what ends up being the problem when/if it’s solved.