It seems the compiler’s constant folding gets confused when “mul.wide.u16” is used:
__global__ void testKernel(unsigned int *output)
{
unsigned short al, ah;
al = 0x1234;
ah = 0x10;
asm volatile ("mul.wide.u16 %0, %1, %2;\n\t" : "=r"(output[0]) : "h"(ah), "h"(al));
}
This kernel should return 0x12340, and when compiled in debug mode it does. However, in release mode, it returns 0x2340 - the number gets truncated when it shouldn’t.
Looking at the generated SASS code, the whole program gets compiled down to
[I wrote and posted the following before I saw the response from txbob]
I can reproduce this with CUDA 8. For every PTXAS optimization level other than -O0, the generated code is broken. Since CUDA 9.1 is out, I would suggest trying that, and if the bug still exists, file a bug report with NVIDIA. The bug reporting form is linked from the registered developer website.
As a workaround, try compiling for an earlier architecture (I tried sm_30) , and then rely on JIT compilation to map that code to your actual GPU. Sadly, when I try that (JIT to sm_50), I still get the wrong result, which makes no sense whatsoever (how would the upper 16 bits for a MOV32I get lost in the JIT process?). Looks like NVIDIA really needs to step up their game in regard to PTXAS testing.