Simple kernel crashes Nvidia driver CUDA bug report

This simple little kernel crashes the Nvidia 197.45 driver. Here are my system specs:
Windows 7 64
MS Visual Studio 2005
Intel i975 12 GB memory GTX 295 video

I attached the .cu file that crashes the driver. I also print it in this message for convenience. Note:

  1. This kernel appears to do nothing useful because I distilled it down from a huge kernel, isolating the minimum code necessary to reproduce the problem.
  2. Trivial changes to this code let the kernel work fine. The crash scenario appears to be very sensitive.
  3. The outer loop must execute a huge number of times. It runs thousands of passes before the driver crashes.
  4. The body of the ‘if’ statement inside the loop SHOULD never execute, but my experiments indicate that it does seem to run! Is the compiler generating bad machine code?
  5. After this kernel runs for a few seconds, my screen goes black for a few seconds. It comes back with a message saying that the driver has recovered from a serious error. The program ends with a cudaError return code of 30 immediately after the kernel launch.
  6. This bug must be rare, because I have thousands of lines of complex CUDA code that run perfectly. This is the first problem I’ve run into.

global void cuda_pass ( float *rets )

int ithread, icase, iop, op, opcodes[5] ;

ithread = blockIdx.x * blockDim.x + threadIdx.x ;

if (ithread != 0) // This code executes for only thread 0
return ; // So we don’t have to worry about inter-thread issues

opcodes[0] = opcodes[1] = opcodes[2] = opcodes[3] = opcodes[4] = 0 ;

for (icase=0 ; icase<5000000 ; icase++) { // Loop runs thousands of times before driver crashes
for (iop=0 ; iop<5 ; iop++) {
op = opcodes[iop] ;
if (op != 0) { // The body of this if clause should never be executed
rets[ithread] = op ; // But it appears to be! Ithread will be zero only.
break ;
} // For all opcodes
} // For all cases

int main ( int argc, char** argv )
float *CUDA_returns ;
cudaError_t CudaError ;

CudaError = cudaGetLastError () ; // Reset CUDA error flag
if (CudaError != cudaSuccess) {
printf ( “Startup error %d”, (int) CudaError ) ;
_getch () ;
exit ( 1 ) ;

CudaError = cudaMalloc ( (void **) &CUDA_returns , (size_t) (32 * sizeof(float))) ;
if (CudaError != cudaSuccess) {
printf ( “Malloc error %d”, (int) CudaError ) ;
_getch () ;
exit ( 1 ) ;

cuda_pass <<< 1, 32 >>> ( CUDA_returns ) ; // This also fails for 1 thread
cudaThreadSynchronize() ;

CudaError = cudaGetLastError () ;
if (CudaError != cudaSuccess) {
printf ( “Kernel error %d”, (int) CudaError ) ;
_getch () ;
exit ( 1 ) ;

cudaFree ( CUDA_returns ) ;
cudaThreadExit () ;

printf ( “Success” ) ;
_getch () ;
} (1.96 KB)

Err don’t you have some sort of integer overflow there? Your counter icase is an integer, so you will get rollover after icase=2^31. It certainly won’t make it to 5 000 000, and what it does when the sign bit gets incremented might be a bit unpredictable. What happens if you make icase a long instead?

Avid, I’d believe that if there were three more zeros. With the code as it is, this should be fine…

Oh dear, the perils of sleep deprivation. Apologies for adding to the signal to noise ratio…

Point 5 suggests that the problem is Windows resetting the display driver because the kernel keeps the GPU busy for more than a few seconds (I think the limit was 5 seconds). During which time the GPU doesn’t respond to Windows graphics commands, so Windows thinks that the GPU driver has crashed and resets it. I don’t know how to solve this one though.

Your suggestion sounds like it may play a major role here, in which case I definitely need to find a way to increase that time limit. I’ll soon have apps that run for more than 5 seconds at a shot.

However, the body of the IF statement should never be executed. Yet if I change the rets[ithread]=op statement to something that does not reference global memory, it runs perfectly. I disabled all optimization in case the compiler was optimizing the loop in some way. The contents of that IF block play a role, even though they should never be executed. If I could disable the 5-second time-out I could learn a lot more.


If you change the rets[ithread]=op statement to something that does not reference global memory then the if statement and loop are probably being eliminated by the compiler because it is useless code - there is no output. I’m not sure if you can disable this optimization - never tried it.

As a side note, how do you know that the code in the if statement is running?

I think that, even when the GPU never executes the contents of that if statement, the branching instructions take up so much time that running them 5 000 000 times takes too long.

I don’t know for sure that the IF statement block is executing. All I know is that the contents of this block impact whether it crashes or not.

From the helpful comments I’ve gotten here, I am coming to the conclusion that the following is happening:

Despite the fact that I have disabled all optimization, the compiler is still optimizing the loop into oblivion when I refrain from changing global memory. When I do reference global memory inside the loop, the loop itself, without execution of its IF block, runs so slowly that it causes a timeout.

I do wish that somewhere in the extensive documentation for CUDA, someone would have mentioned that the kernel has a limited processing time before Windows trashes it. I’ve spent about 15 hours trying to track down what I assumed was a bug in my code, only to learn that it is really an undocumented ‘feature’ of the OS and/or driver. Sigh. Anyhow, thank you all for your help. I am now going to start a new thread about timeouts.