Getting around apparent CUDA bugs

sepia.latimanus · September 9, 2011, 12:17am

Hi all,

So I rewrote a grid solver physics program to run on GPUs but my progress has been halted at the moment by what looks an awful lot like two separate CUDA bugs. I’ll explain them separately below:

So when I run a standard test case on the original CPU version it takes 87 timesteps to reach the finish (it calculates a new delta t every timestep based on the stored physical values). Running the same code on the GPU still takes 87 timesteps but by the end of the simulation there is about a 0.01% random error in the timesteps (and physical data). This error is nonreproducible (different deviation from CPU version (consistent result) every time I run the program) which makes me think it’s a hardware issue (along with seeing other people on these forums have the same problem). Originally the error was larger (timesteps between 85 and 90 or so) but then I buffered all my constant and device vectors and knocked the error down to this level. I’m still working on buffering my cudaMalloc3D’d vectors but if that doesn’t work completely I do not know what else to do. My program launches about 87*3 kernels over the life of this simulation; do many CUDA programmers duck this bug entirely by not launching kernels over and over (I launch kernels inside a loop from the CPU to allow for data output etc.)
When I run the program the first time after compiling, I get this error when my first kernel of the program launches

Error: Kernel Failure! (my cudaGetLastError() output)
CUDA ERROR: cudaMemcpy - main.c - var : 4 : unspecified launch failure

So the kernel appears to be failing but the second time I run the program it launches fine (and every time after that). When I double the number of grid elements (doubling the number of blocks) the kernel launch will fail when I run the program for the first two times or so, with the number of initial failures doubling as the number of blocks in the kernel call double. The kernel definitely ends up running though, which makes me think this is another hardware problem. It can be a pain though to have to go through this error message twenty times to launch a run.

Have any of you ever encountered similar behavior, and if so do you have a story of how you got around it?

(I’m running CUDA 4.0 on a GTX 460, -sm_arch=21, although error 1 also appears (with greater random inaccuracy) on CUDA 3.2 and a Tesla M2070. I did not test for the second error with that card and no longer have access to it.)

Thanks!

S

njuffa · September 9, 2011, 5:07am

Both of the observed symptoms (sporadic launch failure, varying results) are consistent with the use of uninitialized data or out of bounds accesses in the GPU code. Have you had a chance to check the GPU code with cuda-memcheck (see: http://developer.nvidia.com/cuda-memcheck)?

sepia.latimanus · September 9, 2011, 2:02pm

Your guess was correct; cuda-memcheck is detecting out-of-bounds memory accesses. Thanks so much for alerting me to this tool, and I feel awfully silly now for trying to pin my problems on CUDA.

seibert · September 9, 2011, 6:53pm

I also didn’t appreciate cuda-memcheck until I blew an entire afternoon swapping GPUs in and out (and between slots) of my development workstation while hunting for a bizarre performance bug. After constructing ever more elaborate models of a crazy CUDA driver/hardware bug, I finally tried out cuda-memcheck and found the off-by-one error in my array indexing 5 seconds later. :)

sepia.latimanus · September 20, 2011, 4:40am

Hi again,

So I worked through the errors cudamemcheck was giving me (the program I’m porting does some weird stuff with arrays) and while this has improved the program performance (the kernel never fails to launch, and the random errors seem to have decreased slightly for larger grid sizes) some nondeterministic error remains (although no cudamemcheck errors ever). I’m at my wit’s end at the moment; if anyone can tell from these symptoms what might be going on or how to proceed with bugshooting I would be very appreciative.

So for larger grid sizes I let the program run to 1000 timesteps. With the larger program, the timestep will occasionally jump to infinity and the simulation will end prematurely. Here are some values of N (cycles) at which this occurs: 390, 518, 646, 134, 902. These numbers are all separated by multiples of 128, the number of threads per block in the kernel call, which is suspicious to me as this number does not appear elsewhere in the program. Also, if I change the 128 number in the code the behavior gets much worse (for example, 97 threads per block causes the program to always take an infinite timestep at the 50th iteration.) Could I have a memory error undetected by cudamemcheck that cycles by 1 every cycle and occasionally causes this breaking?

Also, since this bothers me, here is a description the weird pointer thing (source of cudamemcheck errors, now fixed) which I worry may be running amok on GPU memory. So in the CPU program, an array of Struct structs (five floats, nothing else) is used to make an array of pointers Float** pS[DimX][structelements] and then this is looped over structelements to get each float structure element (for speedup). I preserved this but now use a shared Struct elem_shared[DimX] and then access these as ((float*)&(elem_shared[i]))[n]. Will shared memory always have contiguous structs to allow for this access pattern? Also, each thread has to use the elements before and after its own; bank conflicts can’t yield bad data values, can they?

Also, are there any other debugging programs which might help me? cudamemcheck was already a lifesaver.

Thanks again

S

pasoleatis · September 20, 2011, 7:50am

Hello,

cuda-dbg, the memcheck can be on inside the cuda debugger.

Cristian

Topic		Replies	Views
CUDA_ERROR_ILLEGAL_ADDRESS CUDA Programming and Performance	6	11104	September 26, 2017
Kernel problem, execution stop after ~15min CUDA Programming and Performance	7	1795	November 4, 2016
Tracking Invalid read size and illegal memory access CUDA Programming and Performance	3	7739	May 24, 2016
cuda-memcheck hangs the whole system CUDA Programming and Performance	14	4445	December 31, 2015
This code doesn't work maybe too much threads assigned? CUDA Programming and Performance	8	1094	February 2, 2014
"unspecified launch failure" but "No CUDA-MEMCHECK" CUDA Programming and Performance	7	7530	January 8, 2016
CUDA kernels keep on crashing CUDA Programming and Performance	6	3662	October 27, 2008
Maximum size of memory block in cudaMallocManaged() CUDA Programming and Performance	7	2562	November 28, 2017
Potential Bug, cuda-memcheck can someone verify? Program crashing on GPU initialisation with cuda-me CUDA Programming and Performance	11	3469	April 24, 2020
strange behavior with device emulation CUDA Programming and Performance	5	2698	May 20, 2008

Getting around apparent CUDA bugs

Related topics