CUDA block scheduling problems

I have discovered a feature of CUDA block scheduling where certain blocks can prevent other blocks in different streams from being scheduled to the same SM. This can significantly impact performance in certain situations. I’m wondering if this scheduling strategy is intentional.

I conducted an experiment on a RTX 3090 with CUDA 11.6.

I have two kernels, both of which run for a period of time. The first kernel uses 24 registers, and the second kernel uses 34 registers and 24KB of shared memory.

__global__ void kernel_r24_s0KB(float seconds, clock_t clock_rate, int flag) {

    int smid;
    asm("mov.u32 %0, %smid;" : "=r"(smid));
    if (threadIdx.x == 0)
        printf("kernel_r24_s0KB %d: block id = %d, sm id = %d\n", flag, blockIdx.x, smid);

    clock_t t0 = clock64();
    clock_t t1 = t0;
    while ((t1 - t0) / (clock_rate * 1000.0f) < seconds)
        t1 = clock64();
}
__global__ void kernel_r34_s24KB(float seconds, clock_t clock_rate, int flag) {

    int smid;
    asm("mov.u32 %0, %smid;" : "=r"(smid));
    if (threadIdx.x == 0)
        printf("kernel_r34_s24KB %d: block id = %d, sm id = %d\n", flag, blockIdx.x, smid);

    const int smem_size = 6144;

    __shared__ int smem[smem_size];
    int reg;
    for (int i = 0; i < smem_size; i++) reg = smem[i];
    for (int i = 0; i < smem_size; i++) smem[i] = reg;

    clock_t t0 = clock64();
    clock_t t1 = t0;
    while ((t1 - t0) / (clock_rate * 1000.0f) < seconds)
        t1 = clock64();
}

Then I created two streams and called 41 blocks (each block with only one thread) of kernel1 in one stream and 41 blocks of kernel2 in the other stream. (RTX 3090 has 82 SMs)

clock_t clock_rate = 1900000;

cudaStream_t stream0, stream1;
cudaStreamCreate(&stream0);
cudaStreamCreate(&stream1);

kernel_r24_s0KB<<<41, 1, 0, stream0>>>(2.0f, clock_rate, 0);
kernel_r34_s24KB<<<41, 256, 0, stream1>>>(2.0f, clock_rate, 1);
cudaDeviceSynchronize();

The result was that the 41 blocks of kernel1 were scheduled to SM0, SM2, …, SM80 as expected. However, what was unexpected was that even though there were enough resources on the GPU to continue scheduling kernel2 after the 41 blocks of kernel1 had been scheduled, the characteristic of kernel1 preventing blocks from other streams from being scheduled to its SM (even TPC) caused kernel2 to wait until kernel1 was executed and cleared the SMs. We can see from nsys that kernel2 started executing only after kernel1 cleared the SMs. If we replace kernel1 with another kernel that does not have the characteristic of blocking other blocks, kernel2 will be scheduled and executed simultaneously with kernel1.

I am very confused by this phenomenon. Using 41 threads on the RTX 3090 occupied the entire GPU, and blocks from other streams that were not related to the previous ones could not be scheduled for execution. This is a phenomenon that significantly affects the performance of the GPU. I would like to know why this block scheduling strategy is designed in this way. I would be very grateful if you could answer my question.

Interesting.

I don’t have a 3090 to test on, but I am able to reproduce the observation on a V100 (80 SMs) and CUDA 11.4.

Curiously, if I reverse the invocation order:

kernel_r34_s24KB<<<41, 256, 0, stream1>>>(2.0f, clock_rate, 1);
kernel_r24_s0KB<<<41, 1, 0, stream0>>>(2.0f, clock_rate, 0);

I witness overlap. I can’t explain it at the moment.

Any difference if you get rid of the printf statements?

In chips prior to CC9.0 (GH100) the L1/SharedMemory configuration is per TPC (2 SMs). The configuration can only be changed at a TPC idle (both SMs have not work). I think you can fix the issue in one of the following ways:

  1. In kernel_r24_s0KB allocate shared memory. If you only allocate 1B then you may still get into a case that shared memory is not sufficient to allow kernel_r34_s24KB to run. In this case I would allocate the same amount of shared memory in both kernels and simply not use it in the s0KB kernel.
  2. Use cudaFuncSetAttribute(, cudaFuncAttributePrefferedSharedMemoryCarveout, cudaFuncCachePreferEqual). This is only a hint. I do not know if the driver will choose to small of a shared memory allocation if the first kernel uses 0.

Since you launched in different streams it is possible that kernel_r34_s24KB could be scheduled first. I think you will find that if kernel_r34_s24KB is scheduled first you will get concurrency; whereas, the order you are showing the other SM in the TPC will be set to a shared memory size < 24 KB which will keep kernel_r34_s24KB thread blocks from having sufficient resources to launch until the SM in the TPC with kernel_r24_s0KB completes and the L1/SharedMemory size can be adjusted.

1 Like

It works! It works when I set “cudaFuncAttributePrefferedSharedMemoryCarveout”. Thanks a lot!

1 Like