Bug in compiler constant folding when using mul.wide.u16

It seems the compiler’s constant folding gets confused when “mul.wide.u16” is used:

__global__ void testKernel(unsigned int *output)
{
	unsigned short al, ah;
	al = 0x1234;
	ah = 0x10;
	asm volatile ("mul.wide.u16    %0,  %1,  %2;\n\t" : "=r"(output[0]) : "h"(ah), "h"(al));
}

This kernel should return 0x12340, and when compiled in debug mode it does. However, in release mode, it returns 0x2340 - the number gets truncated when it shouldn’t.

Looking at the generated SASS code, the whole program gets compiled down to

/*0010*/                   MOV32I R4, 0x2340;      /* 0x1800008d00011de2 */

so it seems to be some bug with the constant folding.

Thanks for the report, I’ve filed a bug internally at NVIDIA. I confirmed the issue is present with CUDA 9.1

[I wrote and posted the following before I saw the response from txbob]

I can reproduce this with CUDA 8. For every PTXAS optimization level other than -O0, the generated code is broken. Since CUDA 9.1 is out, I would suggest trying that, and if the bug still exists, file a bug report with NVIDIA. The bug reporting form is linked from the registered developer website.

As a workaround, try compiling for an earlier architecture (I tried sm_30) , and then rely on JIT compilation to map that code to your actual GPU. Sadly, when I try that (JIT to sm_50), I still get the wrong result, which makes no sense whatsoever (how would the upper 16 bits for a MOV32I get lost in the JIT process?). Looks like NVIDIA really needs to step up their game in regard to PTXAS testing.

code for sm_30
                Function : _Z10testKernelPj
        .headerflags    @"EF_CUDA_SM30 EF_CUDA_PTX_SM(EF_CUDA_SM30)"
                                                           /* 0x2002e04282004007 */
        /*0008*/                   MOV R1, c[0x0][0x44];   /* 0x2800400110005de4 */
        /*0010*/                   MOV R2, c[0x0][0x140];  /* 0x2800400500009de4 */
        /*0018*/                   MOV R3, c[0x0][0x144];  /* 0x280040051000dde4 */
        /*0020*/                   MOV32I R0, 0x12340;     /* 0x1800048d00001de2 */
        /*0028*/                   ST.E [R2], R0;          /* 0x9400000000201c85 */
        /*0030*/                   EXIT;                   /* 0x8000000000001de7 */
        /*0038*/                   BRA 0x38;               /* 0x4003ffffe0001de7 */
                .................................