High number of live registers

CudaDev111 · November 26, 2019, 3:21pm

Hi,

I’m currently working on implementing a multi start local search algorithm in CUDA. Unfortunately, performance is significantly worse than expected. To my knowledge register spilling might be an important part of the problem. When profiling my code using NSight Compute (Visual Studio edition) and the default debug build configuration of the sample Cuda Runtime API sample project, code view shows me that the number of “live registers” is consistently about 130 (or higher).

Thus I created a small sample project to further examine the issue. My code consists of a simple kernel and a device function that get’s called from the kernel.

__global__ void demoKernel()

{

	int sum = 0;
	for (int i = 0; i < 400; i++) {

		sum += demoFunction();

	}

	if ((blockIdx.x * blockDim.x + threadIdx.x) == 1) {
		printf("%d\n", sum);
	}};

__device__  __noinline__ int demoFunction()
{
	int sum = 2;

	return sum;
}

To my suprise NSight Compute shows 128 live registers even for this very simple code. This is also the case when I remove “noinline”.

Is that behavior expected or is there something worng with my configuration? I launched 40 blocks with 256 threads each on my GTX 1050 (mobile) and CUDA Toolkit 10.2 for Windows.

Robert_Crovella · November 26, 2019, 3:31pm

Don’t do that. Really, (in my opinion, hopefully that goes without stating) you should never do any sort of performance analysis on a debug build in CUDA. Debug builds should be used for debugging and little else.

You are likely to see both a noticeable speed up in performance and noticeable reduction in registers used by switching to a release build.

The -G switch used during debug build creation disables most optimizations in CUDA device code, and this has often a dramatic effect on these things (performance, register usage).

njuffa · November 26, 2019, 8:51pm

Based on past observation, I would claim the following stronger statement applies: The -G switch used during debug build creation disables all optimizations and may even add some pessimizations for the sake of improved observability.

CudaDev111 · November 26, 2019, 9:21pm

Thanks for your fast answer. While I was aware that debug builds lack optimization I didn’t know that they shouldn’t be used for profiling.

However, I have a follow up question regarding register usage. My code includes the following nested loop.
While register usage is 24 for the outer loop, it rises to 71 which is close to the 72 allocated registers per thread. All values are of type “int”. Am I doing something worng? From my understanding there is no demand for this higher number of registers.

Furthermore the profiler shows me that the number of executed instructions in the “sum += …” line is about 450.000 while the number of predicted on thread instructions is 13.000.000. Is that expected behavior? For testing purpose I all threads are executed with exactly the same input data and should behave exactly the same. demoObject1 is a pointer to a class object located in global memory. demoObject2 is a thread specific pointer to an object aksi located in global memory.

for (int k = 0; k < demoObject1->someValue1; k++) {
 for (int i = 0; i < demoObject1->someValue2; i++) {
  sum += demoObject1->someArray[i][demoObject2->someValue[i]][k];
 }
}

njuffa · November 26, 2019, 11:19pm

Completely independent of CUDA, the former statement, “lack optimization”, immediately and generally implies the latter statement “shouldn’t be used for profiling”.

Generally speaking, people are interested in the performance characteristics of release builds. Generally speaking, release builds are compiled with optimization enabled. Therefore, generally speaking, profiling should only involve release builds.

Special cases, e.g. reverse engineering efforts, could be an exception to the general rule.

Topic		Replies	Views
nvcc/ptxas under-utilizing registers for arrays CUDA Programming and Performance	13	3192	June 3, 2015
debug build versus release build CUDA Programming and Performance	9	1777	June 24, 2014
Use of register An odd problem CUDA Programming and Performance	12	2294	August 12, 2010
Forcing the compiler to place variables in registers and not in local memory CUDA Programming and Performance	6	2521	April 3, 2019
Kernel WORKS in Release mode, "too many resources requested for launch" in Debug mode CUDA Programming and Performance cuda , kernel , linux , debugger , cuda-gdb	7	1374	August 15, 2022
Understanding how the compiler assigns registers Checking the .cubin file CUDA Programming and Performance	4	3341	November 10, 2008
Effect of launch bounds on register usage and spillage CUDA Programming and Performance	12	252	September 6, 2024
Anyone help me with the link to the register usage break points? CUDA Programming and Performance	4	344	July 14, 2022
How to release register CUDA Programming and Performance cuda	6	997	April 28, 2022
Register usage of a device function for vector rotation CUDA Programming and Performance	14	707	June 12, 2022

High number of live registers

Related topics