Getting around apparent CUDA bugs

Hi all,

So I rewrote a grid solver physics program to run on GPUs but my progress has been halted at the moment by what looks an awful lot like two separate CUDA bugs. I’ll explain them separately below:

  1. So when I run a standard test case on the original CPU version it takes 87 timesteps to reach the finish (it calculates a new delta t every timestep based on the stored physical values). Running the same code on the GPU still takes 87 timesteps but by the end of the simulation there is about a 0.01% random error in the timesteps (and physical data). This error is nonreproducible (different deviation from CPU version (consistent result) every time I run the program) which makes me think it’s a hardware issue (along with seeing other people on these forums have the same problem). Originally the error was larger (timesteps between 85 and 90 or so) but then I buffered all my constant and device vectors and knocked the error down to this level. I’m still working on buffering my cudaMalloc3D’d vectors but if that doesn’t work completely I do not know what else to do. My program launches about 87*3 kernels over the life of this simulation; do many CUDA programmers duck this bug entirely by not launching kernels over and over (I launch kernels inside a loop from the CPU to allow for data output etc.)

  2. When I run the program the first time after compiling, I get this error when my first kernel of the program launches

Error: Kernel Failure! (my cudaGetLastError() output)
CUDA ERROR: cudaMemcpy - main.c - var : 4 : unspecified launch failure

So the kernel appears to be failing but the second time I run the program it launches fine (and every time after that). When I double the number of grid elements (doubling the number of blocks) the kernel launch will fail when I run the program for the first two times or so, with the number of initial failures doubling as the number of blocks in the kernel call double. The kernel definitely ends up running though, which makes me think this is another hardware problem. It can be a pain though to have to go through this error message twenty times to launch a run.

Have any of you ever encountered similar behavior, and if so do you have a story of how you got around it?

(I’m running CUDA 4.0 on a GTX 460, -sm_arch=21, although error 1 also appears (with greater random inaccuracy) on CUDA 3.2 and a Tesla M2070. I did not test for the second error with that card and no longer have access to it.)



Both of the observed symptoms (sporadic launch failure, varying results) are consistent with the use of uninitialized data or out of bounds accesses in the GPU code. Have you had a chance to check the GPU code with cuda-memcheck (see:

Your guess was correct; cuda-memcheck is detecting out-of-bounds memory accesses. Thanks so much for alerting me to this tool, and I feel awfully silly now for trying to pin my problems on CUDA.

I also didn’t appreciate cuda-memcheck until I blew an entire afternoon swapping GPUs in and out (and between slots) of my development workstation while hunting for a bizarre performance bug. After constructing ever more elaborate models of a crazy CUDA driver/hardware bug, I finally tried out cuda-memcheck and found the off-by-one error in my array indexing 5 seconds later. :)

Hi again,

So I worked through the errors cudamemcheck was giving me (the program I’m porting does some weird stuff with arrays) and while this has improved the program performance (the kernel never fails to launch, and the random errors seem to have decreased slightly for larger grid sizes) some nondeterministic error remains (although no cudamemcheck errors ever). I’m at my wit’s end at the moment; if anyone can tell from these symptoms what might be going on or how to proceed with bugshooting I would be very appreciative.

So for larger grid sizes I let the program run to 1000 timesteps. With the larger program, the timestep will occasionally jump to infinity and the simulation will end prematurely. Here are some values of N (cycles) at which this occurs: 390, 518, 646, 134, 902. These numbers are all separated by multiples of 128, the number of threads per block in the kernel call, which is suspicious to me as this number does not appear elsewhere in the program. Also, if I change the 128 number in the code the behavior gets much worse (for example, 97 threads per block causes the program to always take an infinite timestep at the 50th iteration.) Could I have a memory error undetected by cudamemcheck that cycles by 1 every cycle and occasionally causes this breaking?

Also, since this bothers me, here is a description the weird pointer thing (source of cudamemcheck errors, now fixed) which I worry may be running amok on GPU memory. So in the CPU program, an array of Struct structs (five floats, nothing else) is used to make an array of pointers Float** pS[DimX][structelements] and then this is looped over structelements to get each float structure element (for speedup). I preserved this but now use a shared Struct elem_shared[DimX] and then access these as ((float*)&(elem_shared[i]))[n]. Will shared memory always have contiguous structs to allow for this access pattern? Also, each thread has to use the elements before and after its own; bank conflicts can’t yield bad data values, can they?

Also, are there any other debugging programs which might help me? cudamemcheck was already a lifesaver.

Thanks again



cuda-dbg, the memcheck can be on inside the cuda debugger.