I am trying to do matrix multiplication on my nvidia GTX260, which is primary display card in Windows XP. I have read that kernels lasting longer than 5 seconds are killed by the Windows watchdog timer, so I have made sure my CUBLAS calls are short (< 5ms). One matrix multiplication with cublasSgemm seems to work fine. However, when I put the matrix multiplication in a loop, the cudaThreadSynchronize following the call to cublasSgemm gives an “unspecified launch failure” after a seemingly arbitrary number of cycles (usually it is somewhere around 150,000 cycles, however, it varies greatly). I read that many times this is equivalent to “segmentation fault”. So I decided to create a test case based on the CUDA SDK matrix multiplication sample (non-driver version). I just put a loop around the kernel call, followed by a cudaThreadSynchronize(), and once again, the cudaThreadSynchronize returns an “unspecified launch failure” failure after about 150,000 cycles when I increase both matrix sizes to 1024x1024.
Does anybody know what is causing this? It seems very similar to the error that theMatrix got in http://forums.nvidia.com/lofiversion/index.php?t74853.html , however, I get an “unspecified launch failure” rather than a “launch timeout”. Any help would be greatly appreciated.