Any tips for debugging asynchronous kernel launches and memcpy?

I have some code that behaves differently depending on whether CUDA_LAUNCH_BLOCKING=1 is set: different STDOUT output is produced.

I see this difference with multiple streams on a single GPU.

Running the code via racecheck with CUDA_LAUNCH_BLOCKING=0 produces the same output as running it without racecheck, but with CUDA_LAUNCH_BLOCKING=1, and racecheck finds no errors.

I’m guessing that somewhere I forgot to wait for an event, or record it, but where?

I tried reducing the code down to nothing one line at a time, hoping that the last line that makes it cross from apparently buggy to apparently non-buggy (w.r.t. the non-deterministic output) would tell me something, but it didn’t.

Any tips or suggestions would be greatly appreciated. Most of the tutorials I found deal with debugging CUDA kernel execution, rather than asynchronous launches and memcpy.

I don’t know how complex the code is, but if you can reproduce the difference with a fairly short sequence of stream activity, you could run both cases through the visual profiler and look at the difference in the way things stack up on the timeline. That may give you a clue.

I’m guessing the code is complex, however, or you probably would have just dropped a cudaDeviceSynchronize() after each kernel individually to see the effect. (making individual kernel launches blocking, one at a time, instead of all of them at once).

i normally ensure that all individual kernels and tasks execute properly, by stepping through each of them with the debugger, such that i then know that any further error is likely inter-task - not within tasks but between tasks

thereafter, i would insert breakpoints as debug pins at strategic positions, and note the output
you generally know what each kernel/ task should do, and can thus check whether the output you get at a point, seems correct or incorrect
based on this, one could determine whether the error is likely upstream or downstream, and where the error resides

and as i have by now realized, it is indeed very possible to code stream races - i have managed this in the past, and from time to time, still manage to do so
hence, i would also check my code from the vantage point of stream races, due to poor stream synchronization

i do not know of a tool that can point out stream races, and suppose that it would be rather complex to code…

another trick is to code in conditions, that you can break on with a breakpoint
asynchronous execution makes it difficult to track certain errors, but if you can manage to break on the first instance a value or variable or outcome or case misbehaves, you are generally in the position to spot the location this occurs at, to note the values of accomplice variables at that point, and to back-trace from there

Thanks!

I managed to find one bug, although I think there are more (of the kind that change -0.0 to +0.0 depending on CUDA_LAUNCH_BLOCKING)

yes, the flying dutchman

sounds like a float or double

i remember a case where a value was not initialized as a negative value, without inserting it in brackets

and others…

Atomics! It totally slipped my mind that I was using them, and that they are non-deterministic with floating-point.