What techniques do you use to debug your kernels ?
Emulation mode in CUDA is a handy tool to debug some out of bounds reads/writes,
but does not ilustrate wery well threads interactions (especially when shared memory and __syncthreads are used a lot).
Now i have a (common) situation when my kernel runs perfectly in emulation mode,
and just does not run in ‘normal’ mode, the kernel is quite large and complex,
is there any way to know at witch instruction the kernel execution failed, any chance for something like hw breakpoint or some usefull message from cuda runtime why the execution failed ? (out of bound memory access, devide deadlock, watchdog killed kernel execution, etc. etc.)
The only way i see for now to debug the kernel is the trial and error method of commenting blocks of code until we get something that works and then adding bits of code and see what happends, its a huge waste of time
What techniques do you use in such a situations ?