Undefined and NaN results


I have a kernel that uses code that I directly copied from some CPU-only versions. The two versions give identical results, except that something is causing the CUDA version to give NaN in some spots. I looked throught the code, and of course there are normal operators. The only other things are some pow(), abs() and those types of math operations. I am also using the cuComplex library for workingw ith complext numbers, but it is also being used in the CPU version.

Are there any operations that I should look out for that could be particularly troublesome? Is there any way to set a breakpoint for a particular array index value in cuda-gdb, this way I could inspect the values when that index got changed?

Are you running on a SM_2X device? Ocelot http://code.google.com/p/gpuocelot/ should faithfully implement the unordered number behavior of an actual Fermi device. What I usually do is run the emulator over an application that produces the NaN, record the entire instruction trace, search for the first instruction that generated it, then lookup the corresponding source code line.

Even though I use Ocelot myself, I’ve never used it to try to track down NaN instructions like that… makes sense!

That would be a great demo tutorial for Ocelot users, too…

Thanks for the suggestion, I’ll keep it in mind. I am trying to put together some slides for a tutorial…

It might also be a good candidate for a new correctness checking tool. It would be fairly simple to watch the instruction stream and raise an error and print out the source line on the first instruction to generate a NaN. We could also do it as an instrumentation pass and run it on the device, although it would be some extra work.

Well, I found out it was a normal float that was being set to NaN because of a division by zero, I suppose? I never could figure out the exact cause, because it always said my variables were out of scope, but I was able to check that float in the kernel with if(b!=b) and it seems to have fixed the problem.

Seems weird that that same exact code produces correct results when running on the CPU

Some operations have different behaviour with regards to floating point ULP error or treatment of unordered and subnormal numbers in PTX. Most of the time these come from your exp, log, sin, cos, div.approx functions (which have poor accuracy), or by using the ftz modifier for many instructions (rather than dividing by a very small number you end up dividing by 0). There are a lot of different floating point modifiers available for most instructions, and if you tweak them one instruction at a time you can usually get the behaviour the match the CPU. Most CPUs also allow you to set the modes on a per-instruction basis (fesetround/fegetround/_controlfp). When there is a difference, most of the times in my experience the modifiers that you have selected end up not matching up…