I was trying to track down some curious differences in some code of mine and came across an instance of questionable, if not to say simply incorrect, constant propagation in PTXAS. I am using CUDA 8, as I am on a Pascal platform and have not had a need to upgrade yet. If someone could try the test code below with CUDA 10, I would be much obliged.
When propagating a constant through rsqrt.approx.ftz.f64, it seems PTXAS substitutes the full-precision result, rather than the truncated (and far less accurate) result MUFU.RSQ64 actually returns. Interestingly enough, this incorrect constant propagation does not happen with rcp.approx.ftz.f64. In the example below I am passing the instruction rsqrt.approx.ftz.f64 an argument of 100.0: from a kernel argument in kernel1(), and as a literal constant in kernel2(). The output (CUDA 8, Quadro P2000) is as follows:
kernel1: arg= 0x1.90000000000000p+6 res = 0x1.99999000000000p-4
kernel2: arg= 0x1.90000000000000p+6 res = 0x1.999999999999a0p-4
Not a value-preserving optimization and quite unexpected. My minimal test app is as follows:
Thanks for taking my app for a spin. So the problem is still there. I’ll try to file a bug later once I have recovered from my frustration of spending almost three hours tracking the numerical differences I observed to their root cause.
I note that the issue occurs for any PTXAS optimization level above -O0.