Sqrt of positive number is -nan with newer drivers

I’m seeing an issue where the sqrt of a large positive number is -nan in some driver versions. The issue can be duplicated by calling the function below from an example in the OptiX SDK (I’m testing in optixPathTracer). The code works correctly in older drivers (tested with 472.84). I assume the issue is the driver’s PTX compilation substituting a single precision sqrt for the double precision version.
I generally avoid using doubles with CUDA/OptiX, but a few specific calculations really do need them. I might be able to work around this particular case with some scaling hacks, but is there some other method of invoking double precision sqrt that might dodge this substitution issue?
Thanks

__device__ __forceinline__ void sanityCheckSqrt()
{
  // Using volatile due to hardcoded input. Problem occurs without volatile when using calculated input
  volatile double x = 1.0e40;
  x = sqrt(x);
  printf("sqrt(x) %f\n", x); // Prints -nan
}

Current test configuration:
Windows 10 21H2
dual Quadro RTX 4000
512.15 driver
CUDA 11.1 and OptiX 7.3
Problem also occurs with CUDA 10.0 and OptiX 6.5

The OptiX SDK examples usually translate PTX code with the flag --use_fast_math for performance reasons.
That replaces trigonometric, reciprocal, and square root calculations with approximated instructions.

I’m actually not sure if that setting also replaces double with float calculations. I’m not using doubles in OptiX device code because they usually incur a hefty performance hit except on a few workstation GPUs architectures.

See the NVCC help (this was 10.1) for what specific flags are affected (mind the --prec-sqrt=false):

--use_fast_math (-use_fast_math)  Make use of fast math library.  
  '--use_fast_math' implies '--ftz=true --prec-div=false  --prec-sqrt=false --fmad=true'.

Please check what your NVCC command line options are exactly.

If you’re using --use_fast_math, compare the resulting PTX with and without and look for instructions with approx suffix in the use_fast_math case.
Especially compare the code size of the PTX results. Note that the precise calculation of trigonometric functions can result in a lot bigger code and a huge impact on the runtime performance.
Check if your sqrt instruction is using doubles inside the PTX code.

You might want to change everything to the fast math mode except for the sqrt.

I double checked the flags, and we don’t use fast math in our application. It’s a scientific application and we accept the performance impact. I don’t think it affects this issue since the behavior is the same between our application and the SDK examples.
The PTX is sqrt.rn.f64

My understanding is that OptiX compilation (which involves a PTX → LLVM internal conversion and related optimization passes) involves transformations resembling the effects of --ffast-math in C/C++. In other words, removing this flag in nvcc is not enough, because this only represents a small part of the larger compilation task.

It would be very useful to have a OptixModuleCompileOptions::disableFastMath flag in some future API version to prevent undesirable optimizations in a scientific context.

It seems rsqrt stays 64-bit in these drivers.
x = 1.0 / rsqrt( x ); works

Note that according to the PTX ISA 64 bit rsqrt instruction is always a .approx instruction and always uses software emulation [1]. Maybe that’s plenty of precision, but check to make sure, and also look at the PTX & SASS to see what ultimately happens with the reciprocal as well.

Not sure if this is helpful here or not, but wanted to note that you can exercise control and mix fast math and accurate (slower) math if you use the double-underscore device intrinsics explicitly for code you want to force a certain way regardless of what the compiler does.

The PTX ISA also lists the error bounds of the .approx versions of math instructions, in case that’s useful. The bounds are sometimes better than people expect, given the name “approx”.

[1] PTX ISA: rsqrt

“Note that rsqrt.approx.f64 is emulated in software and are relatively slow.”
“For PTX ISA version 1.4 and later, the .approx modifier is required [for rsqrt].”


David.

rsqrt being approximate is noted, and I need to do more testing on the precision impacts. But the answers it returns are much more accurate than -nan.
The issue with sqrt.rn.f64 being apparently treated as single precision persists in the 516.25 driver.

Have you checked whether this looks like a CUDA bug or an OptiX bug? Do you get the right 64 bit behavior if you put the code in one of the CUDA SDK samples?

Also I’m not yet exactly sure where it’s going wrong. Since 1e40 is just slightly larger than FLOAT_MAX, I’m curious if the problem is a cast to float before the PTX sqrt instruction. That might make sense because the bit pattern would have a 0 (negative) sign bit. Maybe you can inspect the bits before and after more carefully than the %f print, or even experiment and see if the problem is that the sqrt is operating on the lower 32 bits of your double?


David.

Testing with the CUDA 11.1 examples (specifically simplePrintf and ptxjit), I do see the correct 64 bit results.
Testing with 1e36 (in OptiX) produces the correct answer, which seems to imply it’s a downcast to float rather than using half the bits of the double.
Here’s the PTX snippet for a bit more context, though it seems the bug is introduced later in compilation.
mov.u64 %rd135, 5205425776111082661;
st.local.u64 [%rd129], %rd135;
ld.local.f64 %fd26, [%rd129];
sqrt.rn.f64 %fd27, %fd26;

I was able to repro this in the OptiX 7.3 SDK with a 515 driver. However, testing in OptiX 7.4 and 7.5 SDKs, I see the correct result for sqrt() with both PTX and OPTIX-IR compilation. So it does appear that this is an OptiX bug, and that it was already fixed along the way somewhere. Sorry about that. Can you try again with OptiX 7.5? FWIW I doubt it matters but I’m also using CUDA 11.7.

BTW, we will be seeing about fixing the bug in 7.3 and earlier. Hopefully the workaround can get you past this immediately, but if not, let us know and we’ll try to prioritize accordingly.


David.

I’m still seeing the -nan with OptiX 7.5. So it looks like the conditions involved in triggering the issue are more complicated than I’d been thinking. Is their any apparent difference in the PTX between your 7.3 and 7.5 cases that might explain the behavior change?

Tested with:
Windows 10 21H2
dual Quadro RTX 4000
516.25 driver
OptiX 7.5 with CUDA 11.1 and Visual Studio 2019
OptiX 7.5 with CUDA 11.7 and Visual Studio 2019
OptiX 7.5 with CUDA 11.7 and Visual Studio 2022

All my previous tests have been Release builds. The Debug executable for the OptiX 7.5 / CUDA 11.7 / Visual Studio 2022 computes the sqrt correctly. The Release executable is still giving -nan

Okay, interesting. This might indeed be more complicated. I’ll try again with 7.5.

I did see a couple of cases that were confusing to me and it seemed like I was reproing with 7.5. However, I started using a more careful workflow to repro where I did two things: first, delete my build directory, then run cmake from scratch, and build from scratch. Second, I deleted my optix cache every time as well. This ensures that the executable, the PTX, and the SASS are all recreated from identical conditions, guarantees that I’m not accidentally picking up something from a previous run.

Is it worth double-checking your 7.5 results with a clean and total rebuild, or did you already do that?

Either way, I have filed a bug report the team will be looking into, and I’ll update the description and repro and priority if the bug has no current fix.


David.

Should I be paranoid about the OptiX cache now? It has seemed well behaved in the past.
Retrying the OptiX 7.5 Release/Debug experiment with new build areas and an empty cache each time produced the same result. Debug works, Release does not.

There’s no reason to be paranoid, the cache is very reliable. It’s me who’s less reliable, sometimes I don’t know or don’t keep track of what’s changed, so cleaning cache eliminates the possibility that I made a mistake or forgot something. And, of course, reproducing with a clean cache shows that caching is not the problem. This bug has been picked up and the team is working on it, we can reproduce, and it appears I was wrong about this being tied to the SDK version number - so I did indeed make a mistake somewhere. Thank you for reporting this issue, we’ll get it fixed and released ASAP. We’ll also take Wenzel’s fast math suggestions under advisement.


David.

A minor update here is that the compiler team has found and fixed the sqrt bug. Thank you for reporting it! It is limited in practice to OptiX programs. What is the importance level of getting this fixed soon, and do you have any acceptable workaround in the mean time? Does your rsqrt() trick work, and/or can you do your double precision via CUDA? Sorry for even asking, we want the fix released ASAP, of course, and normally I wouldn’t ask this, but I’m asking because there are some branch scheduling issues that could make getting this out in the current driver branch difficult, and the next major workstation branch is a couple of months out. If the damage is already done and a month or two of extra delay isn’t very relevant at this point, we could let the fix work it’s way into the next major workstation driver update. But if this continues to impact you then we will try to do the work to get it into 515.


David.

The rsqrt method seems to be good enough for my purposes, and I’m planning to stick with it for some time to work around the issue in existing drivers. So I’d say the fix can wait for the regular release schedule.
Thanks

1 Like