Is there a way to time how long it takes for something to run within the kernel? I’ve looked at events but those appear to be outside the kernel, whereas I need to check how long a device call may take within a global call. Something like date_and_time or even just time()
The closest mechanism is
clock64(), described here. This can occasionally be a bit confusing because the compiler may still re-order instructions. Also note that this measurement (like all other device code) happens per-thread. You can find many discussions about using these on this forum and other forums
Oh dear, I was worried you would say that. I’m afraid I tried clock earlier today and got a compiler issue. Specifically I got the error
nvlink error : Undefined reference to ‘clock’ in ‘/tmp/pgfortranuEIm8j4mp4Qb.o’
pgnvd-Fatal-cnvlink completed with exit code 255
while trying to do
integer :: timeGPU
timeGPU = clock()
and my compiler call is
pgfortran -Mcuda -o CRAFT_GPU_Nt CRAFT_GPU_Nt-1.f90 -v
And if it were me, I would use clock64, only.
Looks like clock64 compiled correctly. clock() still causes that same error.
also, one more quick question, is clock64 supposed to output negative numbers? I’m not familiar with timing using clock cycles instead of time. Am I putting the output into the wrong type of variable?
No, it doesn’t output negative numbers. It outputs a 64-bit unsigned integer, so interpreting it as a signed integer is technically wrong. I’d be very surprised, though, if the actual output rolled over the 64-bit signed integer max value. So, I don’t really know what is happening in your case. I think if you are jamming it into a 32-bit integer, than that is probably the issue.
I’ve tried changing the variable from integer::timeGPU to integer(8)::timeGPU but that appears to cause
copyout Memcpy (host=0x0x2207680, dev=0x0x2300b20a00, size=8) FAILED: 77(an illegal memory access was encountered) and I’m not sure why since that’s not happened before. It’s also causing it to compile much much faster which makes me think something may be wrong with the way I’m declaring variables? Is there a different way to specify integer(8)? I’ve been using real*8 successfully until so I assumed interger(8) would work like normal.
Update, I removed the print statement it looks like everything is back to normal. I guess printing the integer(8) is causing problems, which I wouldn’t have guessed since printing real*8 works fine, but I digress. Is there an easy way to output this from within the kernal is printing is causes issues, or is that fact it cannot print them itself a problem?
Update: I removed the print statement it looks like everything is back to normal. I guess printing the integer(8) is causing problems, which I wouldn’t have guessed since printing real*8 works fine, but I digress. Is there an easy way to output this from within the kernel or is printing itself the problem, or is that fact it cannot print abnormal?
I believe you can print from a CUDA Fortran kernel.
You can print from CUDA Fortran kernels just fine, I do it all the time, but for some reason trying to print the integer(8) I put the timeGPU into causes a memory issue.
I don’t seem to have any trouble with it:
$ cat t11a.cuf module mmk USE cudafor ! ! Definition of symbols for real types (RP) ! IMPLICIT NONE ! INTEGER, PARAMETER :: SP = SELECTED_REAL_KIND(6, 37) ! REAL32 INTEGER, PARAMETER :: DP = SELECTED_REAL_KIND(15, 307) ! REAL64 INTEGER, PARAMETER :: SIZE_ = 1024 ! Contains attributes(global) subroutine matrixMultiply(d_mat, d_matT, d_matSym) USE cudadevice integer :: tx, ty, k integer(8) :: dt REAL(DP) :: accum REAL(DP), dimension(:) :: d_mat, d_matT, d_matSym tx = threadIdx%x + (blockIdx%x - 1) * blockDim%x ty = threadIdx%y + (blockIdx%y - 1) * blockDim%y if (tx <= SIZE_ .and. ty <=SIZE_) then accum = 0.0 dt = clock64() do k=1, SIZE_ accum = accum + d_mat((ty-1)*SIZE_+k) * d_matT((k-1)*SIZE_+tx) end do dt = clock64()-dt d_matSym((ty-1)*SIZE_+tx) = accum if (tx == 1 .and. ty == 1) then print *,dt end if end if end subroutine matrixMultiply end module mmk PROGRAM Test ! ! This is the main program for Test ! USE cudafor USE mmk ! IMPLICIT NONE ! REAL(DP), ALLOCATABLE, DEVICE, DIMENSION(:) :: d_mat, d_matT, d_matSym ! INTEGER :: err, i1, i2 type(dim3) :: grid_dim, blk_dim ! ! Allocate storage for the arrays ! Allocate(d_mat(SIZE_*SIZE_),d_matT(SIZE_*SIZE_),d_matSym(SIZE_*SIZE_)) ! ! invoke the kernel ! !Call grid_dim = dim3(SIZE_/32, SIZE_/32, 1) blk_dim = dim3(32, 32, 1) call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym) call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym) err = cudaDeviceSynchronize() ! ! Free storage for the arrays ! Deallocate(d_mat,d_matT,d_matSym) ! END PROGRAM Test $ nvfortran t11a.cuf -o t11a $ ./t11a 301727 290399 $
How strange, I guess I’ll do some more investigating.
It’s the strangest thing. I’ve tested with your version and mine and both of them seem to cause the same
0: DEV_MKDESC: allocate FAILED:77(an illegal memory access was encountered)
error but ONLY when printing. If I remove the print it’s fine. Is it possible I’m missing a package or some setting somewhere? Is it possible that my compiling options could be causing the problem? I’m compiling using this
pgfortran -Mcuda -o CRAFT_GPU_Nt timePrintTest.cuf -v
so your code/file
timePrintTest.cuf is an exact duplicate of the
t11a.cuf that I posted?
Mine doesn’t have the
but aside from that they are identical. I did try commenting out the print to see if that would make it run and it has.
I also tried compiling it as a .f90 and that didn’t work either
At this point, I only have very general suggestions. No idea what is wrong exactly.
- verify the general CUDA fortran capability. Do known good codes such as vector add work correctly?
- run your code with cuda-memcheck, it may give some clues.
- ask your question on the forum for pgi legacy compilers (which is what you are using) here.
Will do, I appreciate your time. I’ll let you know what ends up being the problem when/if it’s solved.