High number of live registers


I’m currently working on implementing a multi start local search algorithm in CUDA. Unfortunately, performance is significantly worse than expected. To my knowledge register spilling might be an important part of the problem. When profiling my code using NSight Compute (Visual Studio edition) and the default debug build configuration of the sample Cuda Runtime API sample project, code view shows me that the number of “live registers” is consistently about 130 (or higher).

Thus I created a small sample project to further examine the issue. My code consists of a simple kernel and a device function that get’s called from the kernel.

__global__ void demoKernel()


	int sum = 0;
	for (int i = 0; i < 400; i++) {

		sum += demoFunction();


	if ((blockIdx.x * blockDim.x + threadIdx.x) == 1) {
		printf("%d\n", sum);
__device__  __noinline__ int demoFunction()
	int sum = 2;

	return sum;

To my suprise NSight Compute shows 128 live registers even for this very simple code. This is also the case when I remove “noinline”.

Is that behavior expected or is there something worng with my configuration? I launched 40 blocks with 256 threads each on my GTX 1050 (mobile) and CUDA Toolkit 10.2 for Windows.

Don’t do that. Really, (in my opinion, hopefully that goes without stating) you should never do any sort of performance analysis on a debug build in CUDA. Debug builds should be used for debugging and little else.

You are likely to see both a noticeable speed up in performance and noticeable reduction in registers used by switching to a release build.

The -G switch used during debug build creation disables most optimizations in CUDA device code, and this has often a dramatic effect on these things (performance, register usage).

Based on past observation, I would claim the following stronger statement applies: The -G switch used during debug build creation disables all optimizations and may even add some pessimizations for the sake of improved observability.

Thanks for your fast answer. While I was aware that debug builds lack optimization I didn’t know that they shouldn’t be used for profiling.

However, I have a follow up question regarding register usage. My code includes the following nested loop.
While register usage is 24 for the outer loop, it rises to 71 which is close to the 72 allocated registers per thread. All values are of type “int”. Am I doing something worng? From my understanding there is no demand for this higher number of registers.

Furthermore the profiler shows me that the number of executed instructions in the “sum += …” line is about 450.000 while the number of predicted on thread instructions is 13.000.000. Is that expected behavior? For testing purpose I all threads are executed with exactly the same input data and should behave exactly the same. demoObject1 is a pointer to a class object located in global memory. demoObject2 is a thread specific pointer to an object aksi located in global memory.

for (int k = 0; k < demoObject1->someValue1; k++) {
 for (int i = 0; i < demoObject1->someValue2; i++) {
  sum += demoObject1->someArray[i][demoObject2->someValue[i]][k];

Completely independent of CUDA, the former statement, “lack optimization”, immediately and generally implies the latter statement “shouldn’t be used for profiling”.

Generally speaking, people are interested in the performance characteristics of release builds. Generally speaking, release builds are compiled with optimization enabled. Therefore, generally speaking, profiling should only involve release builds.

Special cases, e.g. reverse engineering efforts, could be an exception to the general rule.