NVCC compiler issue, with POC

I made this source code to demonstrate an issue which affects the compile process of CUDA kernels.

  • All GPU architectures are affected
  • Tested on CUDA SDK 10.x and 9.x
  • Tested on Ubuntu 18.04 LTS and Windows 10

I created three artificial kernels. They all consist of 100% the same code.
The code is just one function call. All three kernels call the same function using the same parameters.

__global__ void x1 (int *in, int *out) { s (in, out); }
__global__ void x2 (int *in, int *out) { s (in, out); }
__global__ void x3 (int *in, int *out) { s (in, out); }

Two of three kernels will end up in the same bytecode as expected.
The third will create a different bytecode. This one will run at a reduced performance.
This is not a theoretical problem.
My real-world application suffers from this issue and runs at 15% reduced performance.

Steps to reproduce:

$ nvcc -arch=sm_75 -Xptxas="-v" 3k.cu 2>&1 | grep Used
ptxas info    : Used 70 registers, 368 bytes cmem[0]
ptxas info    : Used 53 registers, 368 bytes cmem[0]
ptxas info    : Used 53 registers, 368 bytes cmem[0]

Alternative:

$ nvcc -cubin -arch=sm_75 3k.cu
$ cuobjdump --dump-resource-usage 3k.cubin | grep REG:
  REG:70 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:368 TEXTURE:0 SURFACE:0 SAMPLER:0
  REG:53 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:368 TEXTURE:0 SURFACE:0 SAMPLER:0
  REG:53 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:368 TEXTURE:0 SURFACE:0 SAMPLER:0

The functions s() and t() are relevant to trigger the issue. Please do not try to see any sense in the code in the functions. They do not make sense anymore after I stripped them down as much as possible to simplify debugging for you.

The sourcecode I had to put here since the forum does not offer attaching files:

https://gist.github.com/jsteube/2e89e43bda98db61291bd07ed143cd55

If you want to see this issue in a real-world application:

$ git clone https://github.com/hashcat/hashcat
$ cd hashcat/
$ git checkout dbfd8d949e8afb5b5d4879636b7fac90bc598140
$ make -s
$ ./hashcat -b -m 5300 --mach
... 884,906,820 H/s ...
The kernel which is used was 'm05300_s04' in 'OpenCL/m05300_a3-optimized.cl'. Open the source and remove all code in kernel 'm05300_m04', 'm05300_m08', 'm05300_m16', 'm05300_s08' and 'm05300_s16'.
$ rm -rf kernels/
$ ./hashcat -b -m 5300 --mach
... 1,039,393,182 H/s ...

I suggest filing a bug. The instructions are linked to a sticky post at the top of this forum.

Done: https://developer.nvidia.com/nvidia_bug/2851145