NVCC compiler issue, with POC

atomu · February 6, 2020, 10:19am

I made this source code to demonstrate an issue which affects the compile process of CUDA kernels.

All GPU architectures are affected
Tested on CUDA SDK 10.x and 9.x
Tested on Ubuntu 18.04 LTS and Windows 10

I created three artificial kernels. They all consist of 100% the same code.
The code is just one function call. All three kernels call the same function using the same parameters.

__global__ void x1 (int *in, int *out) { s (in, out); }
__global__ void x2 (int *in, int *out) { s (in, out); }
__global__ void x3 (int *in, int *out) { s (in, out); }

Two of three kernels will end up in the same bytecode as expected.
The third will create a different bytecode. This one will run at a reduced performance.
This is not a theoretical problem.
My real-world application suffers from this issue and runs at 15% reduced performance.

Steps to reproduce:

$ nvcc -arch=sm_75 -Xptxas="-v" 3k.cu 2>&1 | grep Used
ptxas info    : Used 70 registers, 368 bytes cmem[0]
ptxas info    : Used 53 registers, 368 bytes cmem[0]
ptxas info    : Used 53 registers, 368 bytes cmem[0]

Alternative:

$ nvcc -cubin -arch=sm_75 3k.cu
$ cuobjdump --dump-resource-usage 3k.cubin | grep REG:
  REG:70 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:368 TEXTURE:0 SURFACE:0 SAMPLER:0
  REG:53 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:368 TEXTURE:0 SURFACE:0 SAMPLER:0
  REG:53 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:368 TEXTURE:0 SURFACE:0 SAMPLER:0

The functions s() and t() are relevant to trigger the issue. Please do not try to see any sense in the code in the functions. They do not make sense anymore after I stripped them down as much as possible to simplify debugging for you.

The sourcecode I had to put here since the forum does not offer attaching files:

https://gist.github.com/jsteube/2e89e43bda98db61291bd07ed143cd55

If you want to see this issue in a real-world application:

$ git clone https://github.com/hashcat/hashcat
$ cd hashcat/
$ git checkout dbfd8d949e8afb5b5d4879636b7fac90bc598140
$ make -s
$ ./hashcat -b -m 5300 --mach
... 884,906,820 H/s ...
The kernel which is used was 'm05300_s04' in 'OpenCL/m05300_a3-optimized.cl'. Open the source and remove all code in kernel 'm05300_m04', 'm05300_m08', 'm05300_m16', 'm05300_s08' and 'm05300_s16'.
$ rm -rf kernels/
$ ./hashcat -b -m 5300 --mach
... 1,039,393,182 H/s ...

Robert_Crovella · February 13, 2020, 9:32pm

I suggest filing a bug. The instructions are linked to a sticky post at the top of this forum.

atomu · February 14, 2020, 10:01am

Done: https://developer.nvidia.com/nvidia_bug/2851145

Topic		Replies	Views
BUG: Broken register allocation, toolkit 2.3 CUDA Programming and Performance	15	7050	May 10, 2010
A Performance regression on CUDA 7.0 final CUDA Programming and Performance	2	1068	March 29, 2015
nvcc -cuda CUDA Programming and Performance	5	16045	November 18, 2008
CUDA 7.5 on Maxwell 980Ti drops performance by 10x versus CUDA 7.0, and 6.5 CUDA Programming and Performance	46	7569	October 11, 2016
Performance degradation in 7.0. Silly handling of constant memory in SASS vs 6.5 CUDA Programming and Performance	21	3865	April 2, 2015
Cuda 7.5 give a 30% performance loss vs cuda 6.5 CUDA Programming and Performance	33	13952	May 11, 2016
NVRM: Xid (0084:00) kernel does not terminate CUDA Programming and Performance	16	8877	April 10, 2008
Is this a bug of NVCC 5.5 on code generation/optimization? CUDA Programming and Performance	4	892	April 25, 2014
CUDA and Murphy's Law Some things you may bump into... CUDA Programming and Performance	16	19450	August 21, 2007
Possible nvcc compiler error CUDA Programming and Performance	0	1403	September 8, 2009

NVCC compiler issue, with POC

Related topics