I have a compute code that produces NaNs when compiled normally and works correctly when compiled with the -G option. What does the -G option do that could cause that kind of behavior? I’m guessing this is an issue with mixing the kernels with the cusolver calls, but the compute-sanitizer turns up no errors in all of the tools. I’m stumped as to what to look for. Code is too large to share, unfortunately.
See here. It is intended to generate device-debuggable code. Most optimizations that the compiler might do are disabled. Furthermore, additional symbol information (similar to -lineinfo
) is included in the fatbinary.
Device machine code generation ends up being quite different, in my experience. A kernel compiled with -G
will usually have noticeably more machine instructions in it than the same one without.
It’s difficult to say exactly what the -G
option “does” that could cause that behavior, beyond what you see above. There are a few possible cases:
- There is a bug in your code, and optimization tends to make it more visible.
- There is a bug in the compiler that manifests during optimization.
It’s impossible to say which with no code.
You might try updating to the latest toolchain, if you’re not already there. Bugs get fixed all the time. Also, NaN’s are fairly easy to detect. If you have a sequence of calculations, with data flowing from one step to the next, it’s not difficult to detect NaN in between steps in the sequence although it typically requires additional debug code. Identifying the step in the sequence that converts “ordinary” data to NaN may be a useful debug or localization strategy.
Thanks for the info. Turned out to be a race condition between two streams where there should have been only one.