cudaFuncSetCacheConfig and CUDA 5.0

All,

I seem to be encountering an issue with CUDA 5 that I didn’t with CUDA 4.2, and it seems to involve the use of cudaFuncSetCacheConfig.

Namely, my CUDA Fortran code runs just swimmingly if I compile -Mcuda=4.2,cc20 (which is my default). However, I just decided to try it out with -Mcuda=5.0,cc20 and, boom:

[janus:26230] *** Process received signal ***
[janus:26230] Signal: Segmentation fault (11)
[janus:26230] Signal code: Address not mapped (1)
[janus:26230] Failing at address: (nil)
[janus:26230] [ 0] /lib64/libpthread.so.0() [0x3d4440f500]
[janus:26230] *** End of error message ***

This being CUDA Fortran, I “debugged” (aka added some print statements) and was able to track down the crash to a cudaFuncSetCacheConfig call made just before the first GPU kernel call. And, indeed, commenting out this call allowed the program to proceed…until the next cudaFuncSetCacheConfig call, where it crashed.

I decided to whip up a tester (which is just the …/13.4/etc/samples/cudafor/sgemm.cuf code with a FuncSet call) where the important bits are:

  call sgemm_cpu(A, B, gold, m, N, k, alpha, beta)

  ! timing experiment
  time = 0.0

  istat = cudaFuncSetCacheConfig('saxpy_sgemm_sgemmnn_16x16',cudaFuncCachePreferL1)

  istat = cudaGetLastError()
  if (istat /= 0) then
     write (*,*) "Error code from cache set call: ", istat
     write (*,*) "Kernel call failed: ", cudaGetErrorString(istat)
  end if

  istat = cudaEventRecord(start, 0)
  do j = 1, NREPS
    call sgemmNN_16x16<<<blocks, threads>>>(dA, dB, dC, m, N, k, alpha, beta)
  end do
  istat = cudaEventRecord(stop, 0)
  istat = cudaThreadSynchronize()

I then did a couple experiments:

(387) $ pgfortran -V 

pgfortran 13.4-0 64-bit target on x86-64 Linux -tp nehalem 
Copyright 1989-2000, The Portland Group, Inc.  All Rights Reserved.
Copyright 2000-2013, STMicroelectronics, Inc.  All Rights Reserved.
(388) $ pgfortran -Mcuda=4.2,cc20,ptxinfo -Minfo sgemm-funccache.cuf -o cuda42.exe 
ptxas info    : Compiling entry function 'saxpy_sgemm_sgemmnn_16x16_' for 'sm_20'
ptxas info    : Function properties for saxpy_sgemm_sgemmnn_16x16_
    64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 20 registers, 1088+0 bytes smem, 76 bytes cmem[0]
(389) $ pgfortran -Mcuda=5.0,cc20,ptxinfo -Minfo sgemm-funccache.cuf -o cuda50.exe
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'saxpy_sgemm_sgemmnn_16x16_' for 'sm_20'
ptxas info    : Function properties for saxpy_sgemm_sgemmnn_16x16_
    64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 19 registers, 1088 bytes smem, 76 bytes cmem[0]
(390) $ ./cuda42.exe

Device:Tesla S2050, 1147.0 MHz clock, 2687.4 MB memory.

 Test passed!
256x256 * 256x256:	   0.414 ms	  81.004 GFlops/s
(391) $ ./cuda50.exe

Device:Tesla S2050, 1147.0 MHz clock, 2687.4 MB memory.

 Error code from cache set call:             8
 Kernel call failed: 
 invalid device function                                                                                                         
 Test passed!
256x256 * 256x256:	   1.578 ms	  21.267 GFlops/s

Thus, it looks like the CUDA 5 version has a different behavior. In the case of my tester it doesn’t crash out with a Segfault, but it’s still different than CUDA 4.2 in that it triggers an error (and is slower…for some reason).

So, my question is: is this expected? Was cudaFuncSetCacheConfig deprecated? Or, perhaps, is my driver to old? I’m using 304.60.

(And, I suppose, I’m not sure why the CUDA 5 version is slower…even without the cudaFuncSetCacheConfig call; I tested that. Ideas?)

Thanks,
Matt

NVIDIA changed the API from CUDA 4.2 to CUDA 5.0 for cudaFuncSetCacheConfig.

In CUDA 5.0, try just using the subroutine name.

istat = cudaFuncSetCacheConfig(sgemmnn_16x16, cudaFuncCachePreferL1)

Sorry, we try to hide users from things changing out from underneath them, but in this case, there was nothing we could do.

As for performance, you’ll be a lot happier if you turn on optimization.

brentl@sb-leback:~/simple/cuda> pgf90 -Mcuda=4.2 sgemm.cuf
brentl@sb-leback:~/simple/cuda> ./a.out

Test PASSED!
256x256 * 256x256: 0.841 ms 39.884 GFlops/s
brentl@sb-leback:~/simple/cuda> pgf90 -Mcuda=5.0 sgemm.cuf
brentl@sb-leback:~/simple/cuda> ./a.out

Test PASSED!
256x256 * 256x256: 1.873 ms 17.913 GFlops/s
brentl@sb-leback:~/simple/cuda> pgf90 -Mcuda=4.2 -fast sgemm.cuf
brentl@sb-leback:~/simple/cuda> ./a.out

Test PASSED!
256x256 * 256x256: 0.171 ms 196.531 GFlops/s
brentl@sb-leback:~/simple/cuda> pgf90 -Mcuda=5.0 -fast sgemm.cuf
brentl@sb-leback:~/simple/cuda> ./a.out

Test PASSED!
256x256 * 256x256: 0.176 ms 191.071 GFlops/s

Brent,

It still didn’t seem to help me with the tester on my box:

(551) $ grep FuncSet sgemm-funccache-for50.cuf
  istat = cudaFuncSetCacheConfig('sgemmnn_16x16',cudaFuncCachePreferL1)
(552) $ pgfortran -fast -Mcuda=5.0,cc20,ptxinfo sgemm-funccache-for50.cuf -o cuda50.exe
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'saxpy_sgemm_sgemmnn_16x16_' for 'sm_20'
ptxas info    : Function properties for saxpy_sgemm_sgemmnn_16x16_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 51 registers, 1088 bytes smem, 76 bytes cmem[0]
(553) $ ./cuda50.exe

Device:Tesla S2050, 1147.0 MHz clock, 2687.4 MB memory.

 Error code from cache set call:             8
 Kernel call failed: 
 invalid device function                                                                                                         
 Test passed!
256x256 * 256x256:	   0.156 ms	 215.302 GFlops/s

At least, the error call still returns a non-zero istat:

(559) $ diff sgemm-funccache-for50.cuf /opt/pgi/linux86-64/13.4/etc/samples/cudafor/sgemm.cuf 
148,156d147
< 
<   istat = cudaFuncSetCacheConfig('sgemmnn_16x16',cudaFuncCachePreferL1)
< 
<   istat = cudaGetLastError()
<   if (istat /= 0) then
<      write (*,*) "Error code from cache set call: ", istat
<      write (*,*) "Kernel call failed: ", cudaGetErrorString(istat)
<   end if
<

Hey Matt, just a general input on debugging with CUDA Fortran: I’ve lately had some good experience starting with cuda-memcheck on programs that fail with a CUDA error, before going into lengthy debugging with write statements. It may not help here, but it can easily save you a half an hour here and there in the future. The reason is that uninitialized memory on CUDA often seems to lead to a corruption of your state such that strange errors occur at places where you wouldn’t expect it - cuda-memcheck will find those uninitialized places for you.

Take the name out of quotes. The new API takes a function pointer, not a character string

Aaaahhh, that’s it! I missed the missing quotes.

Hmm…I’m thinking I might just see what happens if I just remove all these, rather than trying to maintain all of them for various versions of CUDA. Still, is there a macro to use for that, like _CUDA=50 or something?