How to debug: CUDA kernel fails when there are many threads?

I am using a Quadro K2000M card, CUDA capability 3.0, CUDA Driver 5.5, runtime 5.0, programming with Visual Studio 2010. My GPU algorithm runs many parallel breadth first searches (BFS) of a tree (constant). The threads are independed except reading from a constant array and the tree. In each thread there can be some malloc/free operations, following the BFS algorithm with queues (no recursion). There N threads; the number of tree leaf nodes is also N. I used 256 threads per block and (N+256-1)/256 blocks per grid.

Now the problem is the program works for less N=100000 threads but fails for more than that. It also works in CPU or in GPU thread by thread. When N is large (e.g. >100000), the kernel crashes and then cudaMemcpy from device to host also fails. I tried Nsight, but it is too slow.

Now I set “cudaDeviceSetLimit(cudaLimitMallocHeapSize, 268435456);” I also tried larger values, up to 1G; cudaDeviceSetLimit succeeded but the problem remains.

Does anyone know some common reason for the above problem? Or any hints for further debugging? I tried to put some printf’s, but there are tons of output. Moreover, once a thread crashes, all remaining printf’s are discarded. Thus it is hard to identify the problem.

It sounds like the CUDA kernel is running past the TDR Delay limit. When a kernel runs past this limit, the GPU driver will terminate it. To increase the time limit, follow the instructions here: http://http.developer.nvidia.com/NsightVisualStudio/3.0/Documentation/UserGuide/HTML/Content/Timeout_Detection_Recovery.htm

What do you mean by “I tried Nsight, but it is too slow”?