Debugging illegal accesses

Hello,

I have been trying to debug some illegal access errors in my application lately, and it seems I am spending a lot of energy in something that should be simpler.

Is there any easy way to get a backtrace when the error happens? I have been trying cuda-gdb but it simply does not break when the error occurs. Regular gdb tricks don’t seem to work, am I missing something?

Is it possible to get a backtrace during device sync instead?

If you’re not familiar with using cuda-memcheck, you may want to start there:

https://stackoverflow.com/questions/27277365/unspecified-launch-failure-on-memcpy/27278218#27278218

cuda-gdb usually breaks on an error for me, but I don’t generally try to debug on a display GPU. Make sure you are compiling the code with debug switches -g -G

You may also need to set cuda memcheck on:

https://docs.nvidia.com/cuda/cuda-gdb/index.html#set-cuda-memcheck

for the most precision, you may also need to use autostep:

https://docs.nvidia.com/cuda/cuda-gdb/index.html#example-autostep

you can also enable coredumps:

https://devtalk.nvidia.com/default/topic/1045549/cuda-programming-and-performance/how-to-use-cuda-gdb-core-dump/

regular gdb won’t work of course. It knows nothing about device code or GPU errors.

Yes, I have been using cuda-gdb. I have tried with memcheck on also, but it still does not break on Illegal accesses most of the time. It is not a display GPU, it is a V100 used only for processing. I will try the coredumps, see if it helps.

What happens is that the program ends normally, with an error being returned by the sync function, but cuda-gdb does not break on it. Is that a known bug?

cuda-mecheck (run by itself on the executable) can isolate the exact line of code where the error is occurring. Have you been able to accomplish that? That usually helps most people quite a bit.

what is the actual error being returned by the sync function?

There are some restrictions for using the standalone cuda-memcheck on this particular application.

The sync function returns an illegal access:

“an illegal memory access was encountered”

Which makes sense. I managed to isolate some of these via log prints and fix them, but if I could do it using gdb it would be much more efficient.