I have some code that behaves differently depending on whether CUDA_LAUNCH_BLOCKING=1 is set: different STDOUT output is produced.
I see this difference with multiple streams on a single GPU.
Running the code via racecheck with CUDA_LAUNCH_BLOCKING=0 produces the same output as running it without racecheck, but with CUDA_LAUNCH_BLOCKING=1, and racecheck finds no errors.
I’m guessing that somewhere I forgot to wait for an event, or record it, but where?
I tried reducing the code down to nothing one line at a time, hoping that the last line that makes it cross from apparently buggy to apparently non-buggy (w.r.t. the non-deterministic output) would tell me something, but it didn’t.
Any tips or suggestions would be greatly appreciated. Most of the tutorials I found deal with debugging CUDA kernel execution, rather than asynchronous launches and memcpy.