Originally published at: How to Improve CUDA Kernel Performance with Shared Memory Register Spilling | NVIDIA Technical Blog
When a CUDA kernel requires more hardware registers than are available, the compiler is forced to move the excess variables into local memory, a process known as register spilling. Register spilling affects performance because the kernel must access local memory—physically located in global memory—to read and write the spilled data. In CUDA Toolkit 13.0, NVIDIA…
Hi! I’m trying enable_smem_spilling feature with the code example provided in the blog, and have a few questions, any suggestions are appreciated. Thanks in advance.
Environment:
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Nov__7_07:23:37_PM_PST_2025
Cuda compilation tools, release 13.1, V13.1.80
Build cuda_13.1.r13.1/compiler.36836380_0
Played with the code example provided in the blog:
-
nvcc -arch=sm_90 -Xptxas -v main.cu, the output didn’t show any spills of stores and loads:ptxas info : 0 bytes gmem ptxas info : Compiling entry function 'foo' for 'sm_90' ptxas info : Function properties for foo 2496 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 20 registers, used 0 barriers, 2496 bytes cumulative stack size ptxas info : Compile time = 221.407 ms -
I removed the
volatilekeyword and increase themaxThreadsPerBlockof__launch_bounds__, then reproduced the spills@@ -1,17 +1,18 @@ #include <cuda_runtime.h> #include <stdio.h> -extern "C" __launch_bounds__(256) +extern "C" __launch_bounds__(512) __global__ void foo(float *output_tensor, int num_elements) { int thread_id = blockIdx.x * blockDim.x + threadIdx.x; if (thread_id >= num_elements) return; - volatile float input_feature[89], weight_scaled[89], bias_added[89], pre_activation[89]; - volatile float activation_sin[89], activation_cos[89], output_accum[89]; + float input_feature[89], weight_scaled[89], bias_added[89], pre_activation[89]; + float activation_sin[89], activation_cos[89], output_accum[89]; #pragma unroll for (int i = 0; i < 89; ++i) { + pre_activation[i] = (float)i * 0.1f; input_feature[i] = (float)thread_id + i; weight_scaled[i] = input_feature[i] * 2.0f; bias_added[i] = 5 + weight_scaled[i]; @@ -26,7 +27,7 @@ float combined = amplified + activation_cos[i]; output_accum[i] = combined; } - volatile float sum = 0.0f; + float sum = 0.0f; #pragma unroll for (int i = 0; i < 89; ++i) { sum += input_feature[i] + weight_scaled[i] + bias_added[i] + pre_activation[i]nvcc -arch=sm_90 main.fix.cu -Xptxas -v ptxas info : 0 bytes gmem ptxas info : Compiling entry function 'foo' for 'sm_90' ptxas info : Function properties for foo 1144 bytes stack frame, 1604 bytes spill stores, 1604 bytes spill loads ptxas info : Used 64 registers, used 0 barriers, 1144 bytes cumulative stack size ptxas info : Compile time = 211.375 ms -
Added the PTX pragma
enable_smem_spillingthrough inline assembly in the modified version code example, recompiled with nvcc, then I got this:ptxas info : 0 bytes gmem ptxas info : Compiling entry function 'foo' for 'sm_90' ptxas info : Function properties for foo 1240 bytes stack frame, 1828 bytes spill stores, 1828 bytes spill loads ptxas info : Used 40 registers, used 0 barriers, 1240 bytes cumulative stack size, 47104 bytes smem ptxas info : Compile time = 217.580 ms
Here are my questions:
- Why my output in 1 is different with the output in the blog post? In my output, I got 20 registers used and no spills. Is this related to the
volatilekeyword? - For my modified code example, adding the PTX pragma
enable_smem_spillingcaused less registers use, more stack frame use and more spills. IIUC, this is not expected. Did I do something wrong?