CUDA block scheduling problems

1771838874 · March 22, 2023, 2:04am

I have discovered a feature of CUDA block scheduling where certain blocks can prevent other blocks in different streams from being scheduled to the same SM. This can significantly impact performance in certain situations. I’m wondering if this scheduling strategy is intentional.

I conducted an experiment on a RTX 3090 with CUDA 11.6.

I have two kernels, both of which run for a period of time. The first kernel uses 24 registers, and the second kernel uses 34 registers and 24KB of shared memory.

__global__ void kernel_r24_s0KB(float seconds, clock_t clock_rate, int flag) {

    int smid;
    asm("mov.u32 %0, %smid;" : "=r"(smid));
    if (threadIdx.x == 0)
        printf("kernel_r24_s0KB %d: block id = %d, sm id = %d\n", flag, blockIdx.x, smid);

    clock_t t0 = clock64();
    clock_t t1 = t0;
    while ((t1 - t0) / (clock_rate * 1000.0f) < seconds)
        t1 = clock64();
}
__global__ void kernel_r34_s24KB(float seconds, clock_t clock_rate, int flag) {

    int smid;
    asm("mov.u32 %0, %smid;" : "=r"(smid));
    if (threadIdx.x == 0)
        printf("kernel_r34_s24KB %d: block id = %d, sm id = %d\n", flag, blockIdx.x, smid);

    const int smem_size = 6144;

    __shared__ int smem[smem_size];
    int reg;
    for (int i = 0; i < smem_size; i++) reg = smem[i];
    for (int i = 0; i < smem_size; i++) smem[i] = reg;

    clock_t t0 = clock64();
    clock_t t1 = t0;
    while ((t1 - t0) / (clock_rate * 1000.0f) < seconds)
        t1 = clock64();
}

Then I created two streams and called 41 blocks (each block with only one thread) of kernel1 in one stream and 41 blocks of kernel2 in the other stream. (RTX 3090 has 82 SMs)

clock_t clock_rate = 1900000;

cudaStream_t stream0, stream1;
cudaStreamCreate(&stream0);
cudaStreamCreate(&stream1);

kernel_r24_s0KB<<<41, 1, 0, stream0>>>(2.0f, clock_rate, 0);
kernel_r34_s24KB<<<41, 256, 0, stream1>>>(2.0f, clock_rate, 1);
cudaDeviceSynchronize();

The result was that the 41 blocks of kernel1 were scheduled to SM0, SM2, …, SM80 as expected. However, what was unexpected was that even though there were enough resources on the GPU to continue scheduling kernel2 after the 41 blocks of kernel1 had been scheduled, the characteristic of kernel1 preventing blocks from other streams from being scheduled to its SM (even TPC) caused kernel2 to wait until kernel1 was executed and cleared the SMs. We can see from nsys that kernel2 started executing only after kernel1 cleared the SMs. If we replace kernel1 with another kernel that does not have the characteristic of blocking other blocks, kernel2 will be scheduled and executed simultaneously with kernel1.

I am very confused by this phenomenon. Using 41 threads on the RTX 3090 occupied the entire GPU, and blocks from other streams that were not related to the previous ones could not be scheduled for execution. This is a phenomenon that significantly affects the performance of the GPU. I would like to know why this block scheduling strategy is designed in this way. I would be very grateful if you could answer my question.

Robert_Crovella · March 22, 2023, 2:22am

Interesting.

I don’t have a 3090 to test on, but I am able to reproduce the observation on a V100 (80 SMs) and CUDA 11.4.

Curiously, if I reverse the invocation order:

kernel_r34_s24KB<<<41, 256, 0, stream1>>>(2.0f, clock_rate, 1);
kernel_r24_s0KB<<<41, 1, 0, stream0>>>(2.0f, clock_rate, 0);

I witness overlap. I can’t explain it at the moment.

deletecode · March 23, 2023, 7:07am

Any difference if you get rid of the printf statements?

Greg · March 24, 2023, 2:56am

In chips prior to CC9.0 (GH100) the L1/SharedMemory configuration is per TPC (2 SMs). The configuration can only be changed at a TPC idle (both SMs have not work). I think you can fix the issue in one of the following ways:

In kernel_r24_s0KB allocate shared memory. If you only allocate 1B then you may still get into a case that shared memory is not sufficient to allow kernel_r34_s24KB to run. In this case I would allocate the same amount of shared memory in both kernels and simply not use it in the s0KB kernel.
Use cudaFuncSetAttribute(, cudaFuncAttributePrefferedSharedMemoryCarveout, cudaFuncCachePreferEqual). This is only a hint. I do not know if the driver will choose to small of a shared memory allocation if the first kernel uses 0.

Since you launched in different streams it is possible that kernel_r34_s24KB could be scheduled first. I think you will find that if kernel_r34_s24KB is scheduled first you will get concurrency; whereas, the order you are showing the other SM in the TPC will be set to a shared memory size < 24 KB which will keep kernel_r34_s24KB thread blocks from having sufficient resources to launch until the SM in the TPC with kernel_r24_s0KB completes and the L1/SharedMemory size can be adjusted.

1771838874 · March 24, 2023, 3:02am

It works! It works when I set “cudaFuncAttributePrefferedSharedMemoryCarveout”. Thanks a lot!

Topic		Replies	Views
Thread Block Scheduler uses disjoint SMs for 2 kernels in separate streams CUDA Programming and Performance	3	191	February 3, 2025
Bug report: Incorrect block scheduling CUDA Programming and Performance	18	7894	February 19, 2010
Kernels from different non-blocking streams are not executed concurrently CUDA Programming and Performance	1	692	March 26, 2021
Assign blocks to SMs CUDA Programming and Performance	5	1801	February 4, 2019
How the blocks of different kernels execute on fermi card? CUDA Programming and Performance	2	1656	June 7, 2011
Cuda multi stream schedule CUDA Programming and Performance	2	1812	October 11, 2023
Blocking scheduler - Question about the priority of scheduling kernel blocks on concurrent streams CUDA Programming and Performance	2	429	September 8, 2023
A basic question about block scheduling CUDA Programming and Performance	13	2428	March 17, 2014
What resources are needed for a block to run? CUDA Programming and Performance	9	3279	May 21, 2009
Ensuring blocks per SM CUDA Programming and Performance	4	1177	February 20, 2012

CUDA block scheduling problems

Related topics