It’s entirely plausible that an error does not become evident until it is run with cuda-memcheck, or the memory checker built into cuda-gdb. This is the reason why cuda-memcheck was created, and in some ways it is similar to tools like valgrind - which finds “hidden” or “latent” errors.
An example of such an error would be reading one element beyond the end of allocated memory. Ordinary host code or device code won’t throw an error in such a situation, even though it is illegal behavior. However, host code run under valgrind, or CUDA code run under cuda-memcheck, will identify such a situation. I’m not saying this is exactly your situation, just giving an example of the plausibility of your situation.
With respect to how to debug such an issue, I would start by running your code in an ordinary fashion (not in cuda-gdb) but with cuda-memcheck.
Follow the instructions here to localize the illegal memory access to a specific line of kernel source code:
https://stackoverflow.com/questions/27277365/unspecified-launch-failure-on-memcpy/27278218#27278218
Such errors often come about due to erroneous indexing. Once you’ve identified the specific line of source code that is causing the error, you may immediately spot the issue or may be able to use printf statements in kernel code to identify what is happening.
Barring that, you can use that line of source code to focus your effort with cuda-gdb. Set a breakpoint immediately before that line of source code, and inspect variables, indices, etc. If need be, work backward, in typical debugging fashion (at this point the debugging concepts are no different than host code debugging: set breakpoints, inspect variables, single-step, etc.)
If you can ascertain the actual condition (e.g. index out of range) that is causing the illegal memory access from the cuda-memcheck experiment, you could use that information to further focus your effort in cuda-gdb by setting a conditional breakpoint, based on the index value, for example. This will cause the breakpoint to occur on the thread/warp that was actually about to make the illegal access. Be advised that using conditional breakpoints can have a large effect on the speed of debugging (speed of code execution in debug mode under cuda-gdb).