Questionable constant propagation in PXTAS?

I was trying to track down some curious differences in some code of mine and came across an instance of questionable, if not to say simply incorrect, constant propagation in PTXAS. I am using CUDA 8, as I am on a Pascal platform and have not had a need to upgrade yet. If someone could try the test code below with CUDA 10, I would be much obliged.

When propagating a constant through rsqrt.approx.ftz.f64, it seems PTXAS substitutes the full-precision result, rather than the truncated (and far less accurate) result MUFU.RSQ64 actually returns. Interestingly enough, this incorrect constant propagation does not happen with rcp.approx.ftz.f64. In the example below I am passing the instruction rsqrt.approx.ftz.f64 an argument of 100.0: from a kernel argument in kernel1(), and as a literal constant in kernel2(). The output (CUDA 8, Quadro P2000) is as follows:

kernel1: arg= 0x1.90000000000000p+6  res =  0x1.99999000000000p-4
kernel2: arg= 0x1.90000000000000p+6  res =  0x1.999999999999a0p-4

Not a value-preserving optimization and quite unexpected. My minimal test app is as follows:

#include <stdio.h>
#define TEST_VAL (100.0)
__global__ void kernel1 (double arg)
{
    double res;
    asm ("rsqrt.approx.ftz.f64 %0, %1;" : "=d"(res) : "d"(arg));
    printf ("kernel1: arg=%22.14a  res = %22.14a\n", arg, res);
}
__global__ void kernel2 (double arg)
{
    double res;
    arg = TEST_VAL;
    asm ("rsqrt.approx.ftz.f64 %0, %1;" : "=d"(res) : "d"(arg));
    printf ("kernel2: arg=%22.14a  res = %22.14a\n", arg, res);
}
int main (void)
{
    kernel1 <<<1,1>>>(TEST_VAL);
    kernel2 <<<1,1>>>(TEST_VAL);
    cudaDeviceSynchronize();
    return 0;
}

This is from CUDA 10.0.130, CentOS7, Tesla P100 (note that CUDA 10.1 is now available):

$ cuda-memcheck ./t405
========= CUDA-MEMCHECK
kernel1: arg= 0x1.90000000000000p+6  res =  0x1.99999000000000p-4
kernel2: arg= 0x1.90000000000000p+6  res =  0x1.999999999999a0p-4
========= ERROR SUMMARY: 0 errors
$ cuobjdump -sass ./t405

Fatbin elf code:
================
arch = sm_60
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_60

Fatbin elf code:
================
arch = sm_60
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

        code for sm_60
                Function : _Z7kernel2d
        .headerflags    @"EF_CUDA_SM60 EF_CUDA_PTX_SM(EF_CUDA_SM60)"
                                                                   /* 0x001fc400fe2007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;          /* 0x4c98078000870001 */
        /*0010*/                   IADD32I R1, R1, -0x10 ;         /* 0x1c0fffffff070101 */
        /*0018*/                   MOV32I R8, 0x0 ;                /* 0x010000000007f008 */
                                                                   /* 0x001fc800fe2007f1 */
        /*0028*/                   MOV32I R9, 0x40590000 ;         /* 0x010405900007f009 */
        /*0030*/                   MOV32I R10, 0x9999999a ;        /* 0x0109999999a7f00a */
        /*0038*/                   MOV32I R11, 0x3fb99999 ;        /* 0x0103fb999997f00b */
                                                                   /* 0x001f84001e2007f0 */
        /*0048*/         {         MOV32I R4, 0x0 ;                /* 0x010000000007f004 */
        /*0050*/                   STL.128 [R1], R8         }
                                                                   /* 0xef56000000070108 */
        /*0058*/                   IADD R6.CC, R1, c[0x0][0x4] ;   /* 0x4c10800000170106 */
                                                                   /* 0x003ff400fec007f5 */
        /*0068*/                   MOV32I R5, 0x0 ;                /* 0x010000000007f005 */
        /*0070*/                   IADD.X R7, RZ, c[0x0][0x104] ;  /* 0x4c1008000417ff07 */
        /*0078*/                   JCAL 0x0 ;                      /* 0xe220000000000040 */
                                                                   /* 0x001ffc00fc6007ef */
        /*0088*/                   NOP ;                           /* 0x50b0000000070f00 */
        /*0090*/                   NOP ;                           /* 0x50b0000000070f00 */
        /*0098*/                   EXIT ;                          /* 0xe30000000007000f */
                                                                   /* 0x001f8000fc0007ff */
        /*00a8*/                   BRA 0xa0 ;                      /* 0xe2400fffff07000f */
        /*00b0*/                   NOP;                            /* 0x50b0000000070f00 */
        /*00b8*/                   NOP;                            /* 0x50b0000000070f00 */
                ......................


                Function : _Z7kernel1d
        .headerflags    @"EF_CUDA_SM60 EF_CUDA_PTX_SM(EF_CUDA_SM60)"
                                                                   /* 0x001fc000fe4007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;          /* 0x4c98078000870001 */
        /*0010*/                   MOV R0, c[0x0][0x144] ;         /* 0x4c98078005170000 */
        /*0018*/         {         IADD32I R1, R1, -0x10 ;         /* 0x1c0fffffff070101 */
                                                                   /* 0x001fc400fe200711 */
        /*0028*/                   MUFU.RSQ64H R11, R0         }
                                                                   /* 0x508000000077000b */
        /*0030*/                   MOV R8, c[0x0][0x140] ;         /* 0x4c98078005070008 */
        /*0038*/                   MOV R9, c[0x0][0x144] ;         /* 0x4c98078005170009 */
                                                                   /* 0x001fc000ffa007f1 */
        /*0048*/                   MOV R10, RZ ;                   /* 0x5c9807800ff7000a */
        /*0050*/                   MOV32I R4, 0x0 ;                /* 0x010000000007f004 */
        /*0058*/         {         MOV32I R5, 0x0 ;                /* 0x010000000007f005 */
                                                                   /* 0x001fd800fec008f1 */
        /*0068*/                   STL.128 [R1], R8         }
                                                                   /* 0xef56000000070108 */
        /*0070*/                   IADD R6.CC, R1, c[0x0][0x4] ;   /* 0x4c10800000170106 */
        /*0078*/                   IADD.X R7, RZ, c[0x0][0x104] ;  /* 0x4c1008000417ff07 */
                                                                   /* 0x001ffc00fd800ffd */
        /*0088*/                   JCAL 0x0 ;                      /* 0xe220000000000040 */
        /*0090*/                   NOP ;                           /* 0x50b0000000070f00 */
        /*0098*/                   EXIT ;                          /* 0xe30000000007000f */
                                                                   /* 0x001f8000fc0007ff */
        /*00a8*/                   BRA 0xa0 ;                      /* 0xe2400fffff07000f */
        /*00b0*/                   NOP;                            /* 0x50b0000000070f00 */
        /*00b8*/                   NOP;                            /* 0x50b0000000070f00 */
                ......................



Fatbin ptx code:
================
arch = sm_60
code version = [6,3]
producer = cuda
host = linux
compile_size = 64bit
compressed
$

CUDA 10.0, GTX 1050:

kernel1: arg= 0x1.90000000000000p+6  res =  0x1.99999000000000p-4
kernel2: arg= 0x1.90000000000000p+6  res =  0x1.999999999999a0p-4

EDIT: As usual, Robert was faster.

Thanks for taking my app for a spin. So the problem is still there. I’ll try to file a bug later once I have recovered from my frustration of spending almost three hours tracking the numerical differences I observed to their root cause.

I note that the issue occurs for any PTXAS optimization level above -O0.

So this is what happens when one retires from NV?

I just filed bug 2533177 for this issue.

The following items have been modified for this bug: Status changed from “Open - Fix being tested” to “Closed - Fixed”