works in Nsight runs, but fails in normal runs. Why?

I have a program which works fine in Nsight CUDA debug runs but it fails in command line runs or Visual Studio 2010 debug runs. Its CPU version works. Are there any common reasons for this?

Or are there any hints for debugging? Since Nsight runs have no errors, the break point inspection in Nsight is uesless. I also put some printf’s in the kernel, but it seems once a kernel breaks down, all remaining printf’s (even in other kernels) will be discarded. So it is very hard to identify which kernel is problematic.

solved. problem of cudaLimitMallocHeapSize, default is too small.