Why cuda kernel use unexpected stack frame?

I have a naive cuda kernel to do half gemm like below

#include <cuda_runtime.h>
#include <cuda_fp16.h>


#define BM 128
#define BN 128
#define BK 32
#define TM 8
#define TN 8


__global__ void threadTilingKernel(const half* A, const half* B, half* C, int M, int N, int K) {
    int tid = threadIdx.y * blockDim.x + threadIdx.x;  // 0 ~ 255

    int load_a_smem_m = tid / BK;  // 0, 0, ...,  0, 1, 1, ...,  1, 2, 2, ...,  2, ... , 7, 7, ...,  7
    int load_a_smem_k = tid % BK;  // 0, 1, ..., 31, 0, 1, ..., 31, 0, 1, ..., 31, ... , 0, 1, ..., 31
    int load_b_smem_k = tid / BN;  // 0, 0, ...,   0, 1, 1, ...,   1
    int load_b_smem_n = tid % BN;  // 0, 1, ..., 127, 0, 1, ..., 127

    constexpr int NUM_THREAD     = BN / TN * BM / TM;  // 256
    constexpr int NUM_ROW_A_SMEM = NUM_THREAD / BK;    // 8
    constexpr int NUM_ROW_B_SMEM = NUM_THREAD / BN;    // 2

    constexpr int NUM_ITER_LOAD_A_SMEM = BM / NUM_ROW_A_SMEM;  // 16
    constexpr int NUM_ITER_LOAD_B_SMEM = BK / NUM_ROW_B_SMEM;  // 16

    __shared__ half a_smem[BM][BK];
    __shared__ half b_smem[BK][BN];

    // int c_reg_int[TM * TN /2] = {0};
    // half* c_reg = reinterpret_cast<half*>(c_reg_int[0]);
    
    half c_reg[TM * TN] = {0.0f};

    for (int bk = 0; bk < K; bk += BK) {
        for (int a_iter = 0; a_iter < NUM_ITER_LOAD_A_SMEM; ++a_iter) {
            a_smem[load_a_smem_m + a_iter * NUM_ROW_A_SMEM][load_a_smem_k] = A[(blockIdx.y * BM + load_a_smem_m + a_iter * NUM_ROW_A_SMEM) * K + bk + load_a_smem_k];
        }

        for (int b_iter = 0; b_iter < NUM_ITER_LOAD_B_SMEM; ++b_iter) {
            b_smem[load_b_smem_k + b_iter * NUM_ROW_B_SMEM][load_b_smem_n] = B[(bk + load_b_smem_k + b_iter * NUM_ROW_B_SMEM) * N + blockIdx.x * BN + load_b_smem_n];
        }
        __syncthreads();

        for (int tm = 0; tm < TM; ++tm) {
            for (int tn = 0; tn < TN; ++tn) {
                for (int k = 0; k < BK; ++k) {
                    c_reg[tm * TN + tn] += a_smem[threadIdx.y * TM + tm][k] * b_smem[k][threadIdx.x * TN + tn];
                }
            }
        }
        __syncthreads();
    }

    for (int tm = 0; tm < TM; ++tm) {
        for (int tn = 0; tn < TN; ++tn) {
            C[(blockIdx.y * BM + threadIdx.y * TM + tm) * N + blockIdx.x * BN + threadIdx.x * TN + tn] = c_reg[tm * TN + tn];
        }
    }
}



int main() {
    return 0;
}

I compiled the file on NVidia A10 GPU with nvcc test.cu -Xptxas -warn-spills -res-usage -arch=sm_86 -O3 -o tmp

The result is

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z18threadTilingKernelPK6__halfS1_PS_iii' for 'sm_86'
ptxas info    : Function properties for _Z18threadTilingKernelPK6__halfS1_PS_iii
    128 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 48 registers, 16384 bytes smem, 388 bytes cmem[0]

AFAIK, each cuda kernel can use at most 255 registers, and this kernel only used 48 registers.

I Have two questions here:

  1. Why did this kernel use 128 byte stack frame? Since it still has enough registers (255 - 48 = 207) to hold data, I think no extra stack frame (on local memory) is needed here.

  2. When I replace half c_reg[TM * TN] = {0.0f} with allocate int c_reg_int[TM * TN /2] = {0}; and cast it to the half ptr, the stack frame usage becomes to 0, I wonder what is the different between the two case.

The 128 bytes of stack frame (local memory) usage are here:

64 array elements of 2 bytes each = 128 bytes. Memory is addressable, registers are not. As part of optimizations, the CUDA compiler may (and “may” is the operative word here) scalarize small arrays and place them into registers, if

(1) Each and every access to the array resolves to a specific address at compile time.
(2) The compiler does not deem the array size overly large.

There may be additional requirements I am not aware of. Item (1) has interaction with loop unrolling. By fully unrolling a loop nest, the compiler may discover that all accesses to an array are in fact to addresses fully determined at compile time. The size limit alluded to in item (2) is an implementation artifact and thus not documented. It could change at any time. One could run experiments to find out what it is currently.

My expectation would be that the compiler checks the size limit by looking at the bytes required, not the count of array elements. But: How does the compiler store a data object of type half in a register? For efficient execution, it may use one full 32-bit register for each scalarized 16-bit half element, turning 128 bytes of array storage into 256 bytes worth of register storage, which could then push past the size limit. I admit this hypothesis does not seem to jibe with your pointer casting trick, but it is worth examining with some dedicated experiments. Does your code work correctly when you use the pointer-casting trick?

I am not sure filing an enhancement request against the compiler would be a useful course of action at this stage, as the larger picture of what is happening here is still unclear. It would require some more experiments and staring at generated SASS to develop an understanding of what would one want to request via an RFE.

Hi, @njuffa, thanks for your reply.
I made a mistake here, in the c_reg_int case, it should be

 half* c_reg = reinterpret_cast<half*>(&c_reg_int[0]);

instead of

 half* c_reg = reinterpret_cast<half*>(c_reg_int[0]);

After this, the stack frame usage becomes 128 Byte again.

I do think 128 Byte is small enough to be saved in registers.
And I have another kernel, which also calculate 8 * 8 thread tile like below

#include <cuda_runtime.h>
#include <cuda_fp16.h>


#define BM 128
#define BN 128
#define BK 32
#define TM 8
#define TN 8


__global__ void threadTilingKernel(const half*  A, const half*  B, half*  C, int M, int N, int K) {
    int tid = threadIdx.y * blockDim.x + threadIdx.x;  // 0 ~ 255

    int load_a_smem_m = tid / BK;  // 0, 0, ...,  0, 1, 1, ...,  1, 2, 2, ...,  2, ... , 7, 7, ...,  7
    int load_a_smem_k = tid % BK;  // 0, 1, ..., 31, 0, 1, ..., 31, 0, 1, ..., 31, ... , 0, 1, ..., 31
    int load_b_smem_k = tid / BN;  // 0, 0, ...,   0, 1, 1, ...,   1
    int load_b_smem_n = tid % BN;  // 0, 1, ..., 127, 0, 1, ..., 127

    constexpr int NUM_THREAD     = BN / TN * BM / TM;  // 256
    constexpr int NUM_ROW_A_SMEM = NUM_THREAD / BK;    // 8
    constexpr int NUM_ROW_B_SMEM = NUM_THREAD / BN;    // 2

    constexpr int NUM_ITER_LOAD_A_SMEM = BM / NUM_ROW_A_SMEM;  // 16
    constexpr int NUM_ITER_LOAD_B_SMEM = BK / NUM_ROW_B_SMEM;  // 16

    __shared__ half a_smem[BM][BK];
    __shared__ half b_smem[BK][BN];

    // int c_reg_int[TM * TN /2] = {0};
    // half* c_reg = reinterpret_cast<half*>(&c_reg_int[0]);
    
    half c_reg[TM * TN] = {0.0f};
    

    for (int bk = 0; bk < K; bk += BK) {
        for (int a_iter = 0; a_iter < NUM_ITER_LOAD_A_SMEM; ++a_iter) {
            a_smem[load_a_smem_m + a_iter * NUM_ROW_A_SMEM][load_a_smem_k] = A[(blockIdx.y * BM + load_a_smem_m + a_iter * NUM_ROW_A_SMEM) * K + bk + load_a_smem_k];
        }

        for (int b_iter = 0; b_iter < NUM_ITER_LOAD_B_SMEM; ++b_iter) {
            b_smem[load_b_smem_k + b_iter * NUM_ROW_B_SMEM][load_b_smem_n] = B[(bk + load_b_smem_k + b_iter * NUM_ROW_B_SMEM) * N + blockIdx.x * BN + load_b_smem_n];
        }
        __syncthreads();

        for (int tm = 0; tm < TM; ++tm) {
            for (int tn = 0; tn < TN; ++tn) {
                for (int k = 0; k < BK; ++k) {
                    c_reg[tm * TN + tn] += a_smem[threadIdx.y * TM + tm][k] * b_smem[k][threadIdx.x * TN + tn];
                }
            }
        }
        __syncthreads();
    }

    for (int tm = 0; tm < TM; ++tm) {
        for (int tn = 0; tn < TN; ++tn) {
            C[(blockIdx.y * BM + threadIdx.y * TM + tm) * N + blockIdx.x * BN + threadIdx.x * TN + tn] = c_reg[tm * TN + tn];
        }
    }
}



__global__ void outerProductPackedNoConflictKernel(const half* A, const half* B, half* C, int M, int N, int K) {

    #define PACKED_LOAD(val) (reinterpret_cast<const packed_t*>(&(val))[0])
    #define PACKED_STORE(val) (reinterpret_cast<packed_t*>(&(val))[0])
    #define ROUND_DIV(a, b) (((a) + (b)-1) / (b))

    int tid = threadIdx.y * blockDim.x + threadIdx.x;

    constexpr int VEC_SIZE = 8;

    int load_a_smem_m = tid / (BK / VEC_SIZE);                   // 0, 0,  0,  0, 1, 1,  1,  1, ... , 63, 63, 63, 63
    int load_a_smem_k = (tid & (BK / VEC_SIZE - 1)) * VEC_SIZE;  // 0, 8, 16, 24, 0, 8, 16, 24, ... ,  0,  8, 16, 24
    int load_b_smem_k = tid / (BN / VEC_SIZE);                   // 0, 0, ...,   0, 1, 1, ...,   1, ... , 15, 15, ...,  15
    int load_b_smem_n = (tid & (BN / VEC_SIZE - 1)) * VEC_SIZE;  // 0, 8, ..., 120, 0, 8, ..., 120, ... ,  0,  8, ..., 120

    constexpr int NUM_THREAD     = BN / TN * BM / TM;             // 256
    constexpr int NUM_ROW_A_SMEM = NUM_THREAD / (BK / VEC_SIZE);  // 64
    constexpr int NUM_ROW_B_SMEM = NUM_THREAD / (BN / VEC_SIZE);  // 16

    constexpr int NUM_ITER_LOAD_A_SMEM = ROUND_DIV(BM, NUM_ROW_A_SMEM);  // 2
    constexpr int NUM_ITER_LOAD_B_SMEM = ROUND_DIV(BK, NUM_ROW_B_SMEM);  // 2
    static_assert(NUM_ITER_LOAD_A_SMEM > 0, "");
    static_assert(NUM_ITER_LOAD_B_SMEM > 0, "");

    __shared__ half a_smem[BK][BM + 8];
    __shared__ half b_smem[BK][BN];

    half c_reg[TM][TN] = {0};
    half a_reg[TM];
    half b_reg[TN];

    half a_tmp[VEC_SIZE];

    using packed_t = float4;

    for (int bk = 0; bk < K; bk += BK) {
        for (int a_iter = 0; a_iter < NUM_ITER_LOAD_A_SMEM; ++a_iter) {
            PACKED_STORE(a_tmp) = PACKED_LOAD(A[(blockIdx.y * BM + load_a_smem_m + a_iter * NUM_ROW_A_SMEM) * K + bk + load_a_smem_k]);
            for (int i = 0; i < VEC_SIZE; ++i) {
                a_smem[load_a_smem_k / VEC_SIZE + i * (BK / VEC_SIZE)][load_a_smem_m + a_iter * NUM_ROW_A_SMEM] = a_tmp[i];
            }
        }

        for (int b_iter = 0; b_iter < NUM_ITER_LOAD_B_SMEM; ++b_iter) {
            PACKED_STORE(b_smem[load_b_smem_k + b_iter * NUM_ROW_B_SMEM][load_b_smem_n]) = PACKED_LOAD(B[(bk + load_b_smem_k + b_iter * NUM_ROW_B_SMEM) * N + blockIdx.x * BN + load_b_smem_n]);
        }
        __syncthreads();

        for (int k = 0; k < BK; ++k) {
            for (int tm = 0; tm < TM; tm += VEC_SIZE) {
                PACKED_STORE(a_reg[tm]) = PACKED_LOAD(a_smem[k / VEC_SIZE + (k % VEC_SIZE) * (BK / VEC_SIZE)][threadIdx.y * TM + tm]);
            }

            for (int tn = 0; tn < TN; tn += VEC_SIZE) {
                PACKED_STORE(b_reg[tn]) = PACKED_LOAD(b_smem[k][threadIdx.x * TN + tn]);
            }

            for (int tm = 0; tm < TM; ++tm) {
                for (int tn = 0; tn < TN; ++tn) {
                    c_reg[tm][tn] += a_reg[tm] * b_reg[tn];
                }
            }
        }
        __syncthreads();
    }

    for (int tm = 0; tm < TM; ++tm) {
        for (int tn = 0; tn < TN; tn += VEC_SIZE) {
            PACKED_STORE(C[(blockIdx.y * BM + threadIdx.y * TM + tm) * N + blockIdx.x * BN + threadIdx.x * TN + tn]) = PACKED_LOAD(c_reg[tm][tn]);
        }
    }
}


int main() {
    return 0;
}

here is my compile result

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z34outerProductPackedNoConflictKernelPK6__halfS1_PS_iii' for 'sm_86'
ptxas info    : Function properties for _Z34outerProductPackedNoConflictKernelPK6__halfS1_PS_iii
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 68 registers, 16896 bytes smem, 388 bytes cmem[0]
ptxas info    : Compiling entry function '_Z18threadTilingKernelPK6__halfS1_PS_iii' for 'sm_86'
ptxas info    : Function properties for _Z18threadTilingKernelPK6__halfS1_PS_iii
    128 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 48 registers, 16384 bytes smem, 388 bytes cmem[0]

As you can see , outerProductPackedNoConflictKernel use 0 bytes stack frame, which confuse me a lot here.

Apparently the people in charge of configuring the relevant compiler heuristics disagree. Their goal is to tune heuristics such that they maximize the benefits to the greatest number of users. On average. This also means that there will always be situations where a particular heuristic (or combination of heuristics) is counterproductive.

A single example does not provide a good basis for filing an enhancement request. If one can discern a pattern over time of the compiler making sub-optimal decisions in particular situations, that might become actionable.

Unrolling the loops involving c_reg will use 254 registers without stackframe.

Thanks a lot~
I originally thought that using -O3 would automatically unroll all the for-loop that can be unrolled. It seems that I misunderstood it.

The CUDA compiler largely follows the philosophy of other compilers in HPC space, in that most optimizations (including loop unrolling) are turned on by default. An example of an optimization you have to specifically request because the risk of numerical error is high, is -use_fast_math.

Informal comparisons I have made between CUDA 9 and CUDA 12 indicate that the latter consistently unrolls loops more aggressively. I have not encountered any performance regressions caused by this yet, but my general observation is that the more aggressive loop unrolling that CUDA 12 does on top of the copious unrolling CUDA 9 performed buys just 2% to 3% more performance.

This is hardly surprising, as the largest gains are achieved my unrolling the innermost one or two levels of deep loop nests.

Like other optimizations, loop unrolling is controlled – you guessed it – by heuristics. Increased register usage and a resulting potential reduction in occupancy is one aspect to consider, as maybe be code size: loop bodies exceeding the ICache size are a thing on GPUs and that can cost a few percent in performance.

Deep loop nests are rarely unrolled completely, and programmers might need to use #pragma unroll to get more unrolling than the compiler deems wise. CUDA programmers should use the CUDA profiler to track the impact of such changes, as they occur. I recall more than one thread in these forums where programmer’s insisted that they needed code generated in a certain way, and when they finally made that happen (by hook or by crook), they found that while they achieved their Y goal (get code generated they way they thought it would give their app a performance boost), they failed in the X goal (making their app run faster).

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.