I seem to be encountering an issue with CUDA 5 that I didn’t with CUDA 4.2, and it seems to involve the use of cudaFuncSetCacheConfig.
Namely, my CUDA Fortran code runs just swimmingly if I compile -Mcuda=4.2,cc20 (which is my default). However, I just decided to try it out with -Mcuda=5.0,cc20 and, boom:
[janus:26230] *** Process received signal *** [janus:26230] Signal: Segmentation fault (11) [janus:26230] Signal code: Address not mapped (1) [janus:26230] Failing at address: (nil) [janus:26230] [ 0] /lib64/libpthread.so.0() [0x3d4440f500] [janus:26230] *** End of error message ***
This being CUDA Fortran, I “debugged” (aka added some print statements) and was able to track down the crash to a cudaFuncSetCacheConfig call made just before the first GPU kernel call. And, indeed, commenting out this call allowed the program to proceed…until the next cudaFuncSetCacheConfig call, where it crashed.
I decided to whip up a tester (which is just the …/13.4/etc/samples/cudafor/sgemm.cuf code with a FuncSet call) where the important bits are:
call sgemm_cpu(A, B, gold, m, N, k, alpha, beta) ! timing experiment time = 0.0 istat = cudaFuncSetCacheConfig('saxpy_sgemm_sgemmnn_16x16',cudaFuncCachePreferL1) istat = cudaGetLastError() if (istat /= 0) then write (*,*) "Error code from cache set call: ", istat write (*,*) "Kernel call failed: ", cudaGetErrorString(istat) end if istat = cudaEventRecord(start, 0) do j = 1, NREPS call sgemmNN_16x16<<<blocks, threads>>>(dA, dB, dC, m, N, k, alpha, beta) end do istat = cudaEventRecord(stop, 0) istat = cudaThreadSynchronize()
I then did a couple experiments:
(387) $ pgfortran -V pgfortran 13.4-0 64-bit target on x86-64 Linux -tp nehalem Copyright 1989-2000, The Portland Group, Inc. All Rights Reserved. Copyright 2000-2013, STMicroelectronics, Inc. All Rights Reserved. (388) $ pgfortran -Mcuda=4.2,cc20,ptxinfo -Minfo sgemm-funccache.cuf -o cuda42.exe ptxas info : Compiling entry function 'saxpy_sgemm_sgemmnn_16x16_' for 'sm_20' ptxas info : Function properties for saxpy_sgemm_sgemmnn_16x16_ 64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 20 registers, 1088+0 bytes smem, 76 bytes cmem (389) $ pgfortran -Mcuda=5.0,cc20,ptxinfo -Minfo sgemm-funccache.cuf -o cuda50.exe ptxas info : 0 bytes gmem ptxas info : Compiling entry function 'saxpy_sgemm_sgemmnn_16x16_' for 'sm_20' ptxas info : Function properties for saxpy_sgemm_sgemmnn_16x16_ 64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 19 registers, 1088 bytes smem, 76 bytes cmem (390) $ ./cuda42.exe Device:Tesla S2050, 1147.0 MHz clock, 2687.4 MB memory. Test passed! 256x256 * 256x256: 0.414 ms 81.004 GFlops/s (391) $ ./cuda50.exe Device:Tesla S2050, 1147.0 MHz clock, 2687.4 MB memory. Error code from cache set call: 8 Kernel call failed: invalid device function Test passed! 256x256 * 256x256: 1.578 ms 21.267 GFlops/s
Thus, it looks like the CUDA 5 version has a different behavior. In the case of my tester it doesn’t crash out with a Segfault, but it’s still different than CUDA 4.2 in that it triggers an error (and is slower…for some reason).
So, my question is: is this expected? Was cudaFuncSetCacheConfig deprecated? Or, perhaps, is my driver to old? I’m using 304.60.
(And, I suppose, I’m not sure why the CUDA 5 version is slower…even without the cudaFuncSetCacheConfig call; I tested that. Ideas?)