It seems the compiler’s constant folding gets confused when “mul.wide.u16” is used:
__global__ void testKernel(unsigned int *output)
unsigned short al, ah;
al = 0x1234;
ah = 0x10;
asm volatile ("mul.wide.u16 %0, %1, %2;\n\t" : "=r"(output) : "h"(ah), "h"(al));
This kernel should return 0x12340, and when compiled in debug mode it does. However, in release mode, it returns 0x2340 - the number gets truncated when it shouldn’t.
Looking at the generated SASS code, the whole program gets compiled down to
/*0010*/ MOV32I R4, 0x2340; /* 0x1800008d00011de2 */
so it seems to be some bug with the constant folding.
Thanks for the report, I’ve filed a bug internally at NVIDIA. I confirmed the issue is present with CUDA 9.1
[I wrote and posted the following before I saw the response from txbob]
I can reproduce this with CUDA 8. For every PTXAS optimization level other than -O0, the generated code is broken. Since CUDA 9.1 is out, I would suggest trying that, and if the bug still exists, file a bug report with NVIDIA. The bug reporting form is linked from the registered developer website.
As a workaround, try compiling for an earlier architecture (I tried sm_30) , and then rely on JIT compilation to map that code to your actual GPU. Sadly, when I try that (JIT to sm_50), I still get the wrong result, which makes no sense whatsoever (how would the upper 16 bits for a MOV32I get lost in the JIT process?). Looks like NVIDIA really needs to step up their game in regard to PTXAS testing.
code for sm_30
Function : _Z10testKernelPj
.headerflags @"EF_CUDA_SM30 EF_CUDA_PTX_SM(EF_CUDA_SM30)"
/* 0x2002e04282004007 */
/*0008*/ MOV R1, c[0x0][0x44]; /* 0x2800400110005de4 */
/*0010*/ MOV R2, c[0x0][0x140]; /* 0x2800400500009de4 */
/*0018*/ MOV R3, c[0x0][0x144]; /* 0x280040051000dde4 */
/*0020*/ MOV32I R0, 0x12340; /* 0x1800048d00001de2 */
/*0028*/ ST.E [R2], R0; /* 0x9400000000201c85 */
/*0030*/ EXIT; /* 0x8000000000001de7 */
/*0038*/ BRA 0x38; /* 0x4003ffffe0001de7 */