How to Improve CUDA Kernel Performance with Shared Memory Register Spilling

Originally published at: How to Improve CUDA Kernel Performance with Shared Memory Register Spilling | NVIDIA Technical Blog

When a CUDA kernel requires more hardware registers than are available, the compiler is forced to move the excess variables into local memory, a process known as register spilling.  Register spilling affects performance because the kernel must access local memory—physically located in global memory—to read and write the spilled data. In CUDA Toolkit 13.0, NVIDIA…

Hi! I’m trying enable_smem_spilling feature with the code example provided in the blog, and have a few questions, any suggestions are appreciated. Thanks in advance.

Environment:

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Nov__7_07:23:37_PM_PST_2025
Cuda compilation tools, release 13.1, V13.1.80
Build cuda_13.1.r13.1/compiler.36836380_0

Played with the code example provided in the blog:

  1. nvcc -arch=sm_90 -Xptxas -v main.cu, the output didn’t show any spills of stores and loads:

    ptxas info    : 0 bytes gmem
    ptxas info    : Compiling entry function 'foo' for 'sm_90'
    ptxas info    : Function properties for foo
        2496 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
    ptxas info    : Used 20 registers, used 0 barriers, 2496 bytes cumulative stack size
    ptxas info    : Compile time = 221.407 ms
    
  2. I removed the volatile keyword and increase the maxThreadsPerBlock of __launch_bounds__, then reproduced the spills

    @@ -1,17 +1,18 @@
     #include <cuda_runtime.h>
     #include <stdio.h>
     
    -extern "C" __launch_bounds__(256)
    +extern "C" __launch_bounds__(512)
     __global__ void foo(float *output_tensor, int num_elements) {
     
         int thread_id = blockIdx.x * blockDim.x + threadIdx.x;
         if (thread_id >= num_elements) return;
     
    -    volatile float input_feature[89], weight_scaled[89], bias_added[89], pre_activation[89];
    -    volatile float activation_sin[89], activation_cos[89], output_accum[89];
    +    float input_feature[89], weight_scaled[89], bias_added[89], pre_activation[89];
    +    float activation_sin[89], activation_cos[89], output_accum[89];
     
         #pragma unroll
         for (int i = 0; i < 89; ++i) {
    +        pre_activation[i] = (float)i * 0.1f;
             input_feature[i] = (float)thread_id + i;
             weight_scaled[i] = input_feature[i] * 2.0f;
             bias_added[i] = 5 + weight_scaled[i];
    @@ -26,7 +27,7 @@
             float combined = amplified + activation_cos[i];
             output_accum[i] = combined;
         }
    -    volatile float sum = 0.0f;
    +    float sum = 0.0f;
         #pragma unroll
         for (int i = 0; i < 89; ++i) {
             sum += input_feature[i] + weight_scaled[i] + bias_added[i] + pre_activation[i]
    
    nvcc -arch=sm_90 main.fix.cu -Xptxas -v
    ptxas info    : 0 bytes gmem
    ptxas info    : Compiling entry function 'foo' for 'sm_90'
    ptxas info    : Function properties for foo
        1144 bytes stack frame, 1604 bytes spill stores, 1604 bytes spill loads
    ptxas info    : Used 64 registers, used 0 barriers, 1144 bytes cumulative stack size
    ptxas info    : Compile time = 211.375 ms
    
  3. Added the PTX pragma enable_smem_spilling through inline assembly in the modified version code example, recompiled with nvcc, then I got this:

    ptxas info    : 0 bytes gmem
    ptxas info    : Compiling entry function 'foo' for 'sm_90'
    ptxas info    : Function properties for foo
        1240 bytes stack frame, 1828 bytes spill stores, 1828 bytes spill loads
    ptxas info    : Used 40 registers, used 0 barriers, 1240 bytes cumulative stack size, 47104 bytes smem
    ptxas info    : Compile time = 217.580 ms
    

Here are my questions:

  1. Why my output in 1 is different with the output in the blog post? In my output, I got 20 registers used and no spills. Is this related to the volatile keyword?
  2. For my modified code example, adding the PTX pragma enable_smem_spilling caused less registers use, more stack frame use and more spills. IIUC, this is not expected. Did I do something wrong?