A lot of stalls even with 100% occupancy

letterc67 · April 20, 2025, 9:28am

Hello. I’m writing a 7-points 3D stencil kernel.

My current approach is each block of size 32 x 16 x 1 calculates a sub-matrix of size 30 x 14 x HEIGHT (x, y, z dimension respectively). Each block calculates level z, then move on to level z + 1, and so on.

#define TILE_X 32
#define TILE_Y 16
#define IN_TILE_X 30
#define IN_TILE_Y 14
#define HEIGHT 32
#define PRELOAD 3
#define PRELOAD_COUNT (1 << PRELOAD)

__global__ void stencil(float *input, float *output, float *weight, int N) {
    int x = blockIdx.x * IN_TILE_X + threadIdx.x - 1;
    int y = blockIdx.y * IN_TILE_Y + threadIdx.y - 1;
    int z = blockIdx.z * HEIGHT;

    float prev = 0, next = 0, nxt[1 << PRELOAD] = {0};
    __shared__ float current[TILE_Y][TILE_X];
    __shared__ float weights[7];

    int index = z * N * N + y * N + x;
    int N_squared = N * N;
    
    if (y >= 0 && z >= 0 && x >= 0 && y < N && x < N) {
        current[threadIdx.y][threadIdx.x] = input[index];

        if (z + 1 < N) {
            next = input[index + N_squared];
        }

        if (z - 1 >= 0) {
            prev = input[index - N_squared];
        }
    } else {
        current[threadIdx.y][threadIdx.x] = 0;
    }

    if (threadIdx.x < 7 && threadIdx.y == 0) {
        weights[threadIdx.x] = weight[threadIdx.x];
    }

    #pragma unroll 1
    for (int i = 0; (i < HEIGHT && z < N); i++) {
        if ((i & (PRELOAD_COUNT - 1)) == 0) {
            if (y < N && x < N && x >= 0 && y >= 0)
            for (int j = 0; j < 1 << PRELOAD; j++) {
                if (z + j + 2 < N) {
                    nxt[j] = input[index + N_squared * (2 + j)];
                } else {
                    nxt[j] = 0;
                }
            }
        }
        __syncthreads();
        if (threadIdx.y > 0 && threadIdx.x > 0 && threadIdx.y < TILE_Y - 1 && threadIdx.x < TILE_X - 1 && y < N && x < N) {
            float x0 = fmaf(
                weights[4],
                current[threadIdx.y][threadIdx.x + 1],
                fmaf(weights[2], current[threadIdx.y + 1][threadIdx.x], fmaf(weights[0], current[threadIdx.y][threadIdx.x], 0))
            );
            
            float x1 = fmaf(
                weights[5],
                prev,
                fmaf(weights[3], current[threadIdx.y][threadIdx.x - 1], fmaf(weights[1], current[threadIdx.y - 1][threadIdx.x], 0))
            );
            
            output[index] = fmaf(weights[6], next, x0 + x1);
        }
        __syncthreads();
        prev = current[threadIdx.y][threadIdx.x]; 
        current[threadIdx.y][threadIdx.x] = next;
        index += N_squared;
        z++;
        next = nxt[i & (PRELOAD_COUNT- 1)];
    }
}

I’m benchmarking the performance of this kernel with a matrix of size 768 x 768 x 768. The best performing value for HEIGHT is around 32. Values smaller than 32 perform worse due to reduced arithmetic intensity, which makes sense. But what I don’t understand is once HEIGHT get past 32, the performance consistently degrades. For example, average runtime on my device (RTX 3060):

HEIGHT = 32: 12.8ms
HEIGHT = 128: 16.2ms
HEIGHT = 256: 17.8ms
HEIGHT = 768: 18.2ms

Profiling this kernel with ncu revealed that:

The kernel achieve 100% occupancy across all values of HEIGHT.
Higher HEIGHT also caused more stalls, ~31.5 cycles/inst when HEIGHT = 768 vs. ~20.5 cycles/inst when HEIGHT = 32. Also at HEIGHT = 768 ncu warns that each warp spent around 10 cycles waiting for other warps at the barrier.
Warp Stall Sampling figures are roughly the same.

From my understanding, another ready warp will take place of a stalling warp, and too small grid will hinder parallelism.

My question why does raising HEIGHT cause performance degradation, even with 100% occupancy and relatively big grid. Thank you!!

Also here is the full code I used to benchmark the kernel.
stencil_test.txt (4.6 KB)

Curefab · April 20, 2025, 12:14pm

The shown increase in runtime is quite moderate (< 50%, not 3 or 10 times). It could have other reasons like less caching efficiency.
Could you have a look at the L1 and L2 cache hit rates in Nsight Compute?

letterc67 · April 21, 2025, 4:37am

When HEIGHT = 768:

    ---------------------------- ----------- ------------
    Metric Name                  Metric Unit Metric Value
    ---------------------------- ----------- ------------
    Memory Throughput                Gbyte/s       255.04
    Mem Busy                               %        43.23
    Max Bandwidth                          %        72.85
    L1/TEX Hit Rate                        %         0.02
    L2 Compression Success Rate            %            0
    L2 Compression Ratio                                0
    L2 Compression Input Sectors      sector            0
    L2 Hit Rate                            %        44.78
    Mem Pipes Busy                         %        49.61
    ---------------------------- ----------- ------------

When HEIGHT = 32:

    Section: Memory Workload Analysis
    ---------------------------- ----------- ------------
    Metric Name                  Metric Unit Metric Value
    ---------------------------- ----------- ------------
    Memory Throughput                Gbyte/s       322.75
    Mem Busy                               %        56.72
    Max Bandwidth                          %        92.19
    L1/TEX Hit Rate                        %         0.03
    L2 Compression Success Rate            %            0
    L2 Compression Ratio                                0
    L2 Compression Input Sectors      sector            0
    L2 Hit Rate                            %        56.19
    Mem Pipes Busy                         %        76.61
    ---------------------------- ----------- ------------

The behaviour from both kernels looks correct to me: Each warp load unaligned 128 bytes lines, it should issue two transactions, and may find one in L2 from a previous load. Since most reads are one-pass, and L1 is both small and shared by only several blocks, it can hardly find the load from previous warp, L1 hit rate is close to 0.

But the first kernel did not utilize the memory as well as the second, I believe the difference between two kernels is the consequence of the stall rather than the culprit?

njuffa · April 21, 2025, 6:51am

I haven’t looked at your code in detail, but from a quick glance it seems that an increase in HEIGHT directly increases the stride of accesses through index, which could be responsible for the lower effective bandwidth.

Strided accesses slowing down with increasing stride due to interaction with physical properties of the memory subsystem (such as banking or DRAM-open-page policy) is a fairly common effect (orthogonal to use of GPUs), but one would have to work through the details of the memory organization for this particular GPU to confirm or refute such a hypothesis.

Curefab · April 21, 2025, 7:03am

Stalls are not bad per se, as long as enough warp are active to compensate. And one would expect more stalls with more warps fighting for resources. The question is, why the L2 hit rate goes down.

And it does not explain the full difference, I think.

Could you check the average number of transactions and sectors per memory instruction?

ncu is good for reporting, the graphical Nsight Compute good for quickly checking several values.

Not sure, if it helps in your case: You can round your block sizes in x (preferred) and/or y direction to 16 and 32 and exit the superfluous threads early.

You could store your input differently or multiple times and use more direct indexing. So perhaps it would be better aligned. But I am not sure.

You could try to rearrange work (assignment of work to specific threads) to improve locality of memory accesses to get more from L1.

You could try to let the threads of a warp cooperate for data loading and exchange with shared memory.

That all are general ideas looking at your code, which each not necessarily would work, but could.

letterc67 · April 22, 2025, 11:05am

I believe the number of stride accesses are the same regardless of value of HEIGHT? So I don’t think it is the case here.

letterc67 · April 22, 2025, 11:15am

The average number of number sectors/req on L2 is around 2.5. Interestingly, higher value of HEIGHT causes more L2 read misses, L2 write always hit. Piecing everything together, seems that more L2 misses → more DRAM loads → more stalls.

This question still remains, guess I have to do more experiment.

Curefab · April 22, 2025, 12:57pm

L2 writes always hit the cache, because writes are done as write-back, so any write is just cached and possibly (the GPU could also just keep the writes in L2, as long as L2 is not needed otherwise) done in the background into the DRAM afterwards.

2.5 sectors/req. probably is not too bad in the case of your kernel.

It seems that you have some locality advantages for smaller values of HEIGHT and thus higher hit rate.

Perhaps also calculate the overall amount of data size compared to the L2 size.

Which GPU are you using? The Ada and Blackwell generation GPUs (and the earlier datacenter GPUs like the Ampere A100) have a quite large L2 cache and could help in your case a lot. Otherwise each model of GPU has a different DRAM interface speed.

letterc67 · April 22, 2025, 2:25pm

My GPU is RTX 3060 12GB, which has 3MB L2, quite tiny comparing to ~1.81GB size of a 768x768x768 matrix. I guess it is not too bad because it allows me to observe this interesting phenomenon.

The condition for L2 hit is a neighboring block on the exactly similar z-level accessing a common sector (due to the unaligned load). I tried to print the block indices and it seems that blocks of the same blockIdx.z are scheduled to launch consecutively.

My theory is currently processing z-level across blocks are “resynced” after a new wave of blocks are launched. The longer a block executes, the more its z‑level diverges from that of its neighboring blocks. Smaller HEIGHT allows the “resync” process to happen more frequently.

What do you think?

Curefab · April 22, 2025, 2:39pm

You can try that out:
Do not launch all blocks in z direction at once, but different numbers and compare the speed. Artificially create your waves. You can even launch the kernel in a loop several times from the host.

But relate the number of blocks to the number of SMs so that each is occupied equally.

letterc67 · April 22, 2025, 4:40pm

I modified the kernel a bit and launched it with 768x768xH matrices. And L2 hit rate for a 768x768xH matrix is roughly equivalent to its of a 768x768x768 matrix with HEIGHT=H.

And also considering HEIGHT is inversely proportional to L2 hit rate, I think I can for now put this myth to rest. More work is needed to optimize the kernel further knowing this.

Thank you and @njuffa for helping me!

Curefab · April 22, 2025, 4:58pm

That is fine! You solved some mysteries, learned a bit and can write other code or go back and optimize more.

Glad to help.

Topic		Replies	Views
Improving gather kernel CUDA Programming and Performance	7	1020	September 1, 2022
Waiting for global memory access. CUDA Programming and Performance	32	56343	January 31, 2008
Tesla K40 L2 bandwidth CUDA Programming and Performance	12	4036	December 23, 2015
How does Turing architecture's L1 cache and L2 cache work? CUDA Programming and Performance cuda	8	1912	September 23, 2021
Occupancy Query Performance not as expected CUDA Programming and Performance	11	4450	February 3, 2009
Squeasing max d2d memory bandwidth (GTX 480) CUDA Programming and Performance	15	6995	November 2, 2010
Any way to measure the latency of a kernel launch? CUDA Programming and Performance	13	6602	July 22, 2022
How to know my kernel if Pipeline parallel by nsight compute Nsight Compute	6	884	April 18, 2023
call stack/ stack frame CUDA Programming and Performance	28	9136	October 10, 2014
Question about kernel granularity CUDA Programming and Performance	5	1212	March 22, 2017

A lot of stalls even with 100% occupancy

Related topics