All,
I seem to be encountering an issue with CUDA 5 that I didn’t with CUDA 4.2, and it seems to involve the use of cudaFuncSetCacheConfig.
Namely, my CUDA Fortran code runs just swimmingly if I compile -Mcuda=4.2,cc20 (which is my default). However, I just decided to try it out with -Mcuda=5.0,cc20 and, boom:
[janus:26230] *** Process received signal ***
[janus:26230] Signal: Segmentation fault (11)
[janus:26230] Signal code: Address not mapped (1)
[janus:26230] Failing at address: (nil)
[janus:26230] [ 0] /lib64/libpthread.so.0() [0x3d4440f500]
[janus:26230] *** End of error message ***
This being CUDA Fortran, I “debugged” (aka added some print statements) and was able to track down the crash to a cudaFuncSetCacheConfig call made just before the first GPU kernel call. And, indeed, commenting out this call allowed the program to proceed…until the next cudaFuncSetCacheConfig call, where it crashed.
I decided to whip up a tester (which is just the …/13.4/etc/samples/cudafor/sgemm.cuf code with a FuncSet call) where the important bits are:
call sgemm_cpu(A, B, gold, m, N, k, alpha, beta)
! timing experiment
time = 0.0
istat = cudaFuncSetCacheConfig('saxpy_sgemm_sgemmnn_16x16',cudaFuncCachePreferL1)
istat = cudaGetLastError()
if (istat /= 0) then
write (*,*) "Error code from cache set call: ", istat
write (*,*) "Kernel call failed: ", cudaGetErrorString(istat)
end if
istat = cudaEventRecord(start, 0)
do j = 1, NREPS
call sgemmNN_16x16<<<blocks, threads>>>(dA, dB, dC, m, N, k, alpha, beta)
end do
istat = cudaEventRecord(stop, 0)
istat = cudaThreadSynchronize()
I then did a couple experiments:
(387) $ pgfortran -V
pgfortran 13.4-0 64-bit target on x86-64 Linux -tp nehalem
Copyright 1989-2000, The Portland Group, Inc. All Rights Reserved.
Copyright 2000-2013, STMicroelectronics, Inc. All Rights Reserved.
(388) $ pgfortran -Mcuda=4.2,cc20,ptxinfo -Minfo sgemm-funccache.cuf -o cuda42.exe
ptxas info : Compiling entry function 'saxpy_sgemm_sgemmnn_16x16_' for 'sm_20'
ptxas info : Function properties for saxpy_sgemm_sgemmnn_16x16_
64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 20 registers, 1088+0 bytes smem, 76 bytes cmem[0]
(389) $ pgfortran -Mcuda=5.0,cc20,ptxinfo -Minfo sgemm-funccache.cuf -o cuda50.exe
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'saxpy_sgemm_sgemmnn_16x16_' for 'sm_20'
ptxas info : Function properties for saxpy_sgemm_sgemmnn_16x16_
64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 19 registers, 1088 bytes smem, 76 bytes cmem[0]
(390) $ ./cuda42.exe
Device:Tesla S2050, 1147.0 MHz clock, 2687.4 MB memory.
Test passed!
256x256 * 256x256: 0.414 ms 81.004 GFlops/s
(391) $ ./cuda50.exe
Device:Tesla S2050, 1147.0 MHz clock, 2687.4 MB memory.
Error code from cache set call: 8
Kernel call failed:
invalid device function
Test passed!
256x256 * 256x256: 1.578 ms 21.267 GFlops/s
Thus, it looks like the CUDA 5 version has a different behavior. In the case of my tester it doesn’t crash out with a Segfault, but it’s still different than CUDA 4.2 in that it triggers an error (and is slower…for some reason).
So, my question is: is this expected? Was cudaFuncSetCacheConfig deprecated? Or, perhaps, is my driver to old? I’m using 304.60.
(And, I suppose, I’m not sure why the CUDA 5 version is slower…even without the cudaFuncSetCacheConfig call; I tested that. Ideas?)
Thanks,
Matt