Thread Block Scheduler uses disjoint SMs for 2 kernels in separate streams

Hi all,

I’m experiencing some unexpected behavior of the thread block scheduler when colocating two kernels A and B with CUDA streams. My initial idea was to measure the interference that kernel B is creating within the SM when colocated with kernel A on the GPU. However I noticed that both kernels end up being scheduled on separate set of SMs even though they should be able to run on the same SM. I’m running on a H100 GPU with 132 SMs.
I attached the entire code to reproduce the behavior below. I was not able to replicate the behavior with a smaller example unfortunately (apologies).

Here’s what I’m doing:
For kernel A, I’m running the SGEMM implementation from here SGEMM_CUDA/src/kernels/9_kernel_autotuned.cuh at master · siboehm/SGEMM_CUDA · GitHub. I added a small snippet to the kernel (see code below), that saves the SM ID that each thread block is running on. I run the kernel with matrix dimensions m=n=k=1024. The launch configuration is (8,8,1)x(256,1,1).

For kernel B, I’m using a custom kernel, that just keeps executing some fp64 multiplications. Once again I have each thread block save the SM ID it is running on. I launch this kernel with (132,1,1)x(128,1,1).

When profiling both kernels with NCU, their occupancy section looks as follows:

Based on the occupancy and the launch configuration, I would expect that both kernels can execute concurrently on the same SM, given that neither the max blocks, max regs, max shared mem, max warps per block would be violated.

But when running the program, the behavior is not as expected. I first execute each kernel once, to load it into memory and avoid any potential lazy loading. I then launch the sgemm kernel 150 times in its own stream and I launch the fp64 kernel 4 times in a separate stream, each time with 5000000 iterations such that both kernels have the opportunity to execute at the same time.

However the thread block scheduler does not seem to allow both kernels to run on the same SMs. The sgemm kernel doesn’t start until the very end of the FP64 kernel.


Given that the FP64 kernel is launched with 132 thread blocks, it runs one thread block per SM (I validated this by saving the SM IDs of each thread block). From the screenshot below it looks like once the first thread blocks from the sgemm kernels complete and release the SM, the first thread blocks of the FP64 kernel can be scheduled on them.

By saving the SM IDs of each thread block from both kernels, I am able to see that the FP64 kernel uses:

  • 132 SMs for run 1
  • 94 SMs for run 2

On the other hand the fp64 kernel uses 38 SMs for runs 1 to 81. This strongly supports the case that both kernels are running on separate set of SMs. When rerunning the application multiple times, the sum of SMs used between the sgemm kernel in run 2 and the FP64 kernel is always 132, though the split slightly differs.

Q1: How come the split is able to vary? My current understanding so far was, that the thread block scheduler would first schedule all blocks from one kernel before scheduling blocks from the next kernel. Here it looks like it interleaves the scheduling of both kernels’ blocks.

Q2: What kind of factors would make the thread block scheduler prefer to run kernels on separate sets of SMs rather than on the same SM? Is it trying to avoid interference within the SMs for example? I am not running any MPS server here and consequently do not set the CUDA_MPS_ACTIVE_THREAD_PERCENTAGE variable either.

I however validated the same behavior when running both kernels as part of separate processes and MPS enabled. I did not set the CUDA_MPS_ACTIVE_THREAD_PERCENTAGE variable for any of the processes.

Below is the entire code.

#include <iostream>
#include <cmath>
#include <cstdio>
#include <sys/time.h>
#include <unordered_set>
#include <cuda_runtime.h>

#define SGEMM_THREAD_BLOCKS 64
#define FP64_THREAD_BLOCKS 132
#define FP64_NUM_THREADS 128
#define SGEMM_INPUT_DIM 1024
#define SGEMM_RUNS 150
#define FP64_RUNS 4
#define FP64_NUM_ITERS 5000000

#define CEIL_DIV(M, N) (((M) + (N)-1) / (N))

const int K9_NUM_THREADS = 256;

template <const int BM, const int BN, const int BK, const int TM, const int TN>
__global__ void __launch_bounds__(K9_NUM_THREADS)
    sgemmAutotuned(int M, int N, int K, float alpha, float *A, float *B,
                   float beta, float *C, int* sm_id) {
    // copied from https://github.com/siboehm/SGEMM_CUDA/blob/master/src/kernels/9_kernel_autotuned.cuh
    const uint cRow = blockIdx.y;
    const uint cCol = blockIdx.x;

    // size of warptile
    constexpr int WM = TM * 16;
    constexpr int WN = TN * 16;
    // iterations of warptile
    constexpr int WMITER = CEIL_DIV(BM, WM);
    constexpr int WNITER = CEIL_DIV(BN, WN);

    // Placement of the thread in the warptile
    const int threadCol = threadIdx.x % (WN / TN);
    const int threadRow = threadIdx.x / (WN / TN);

    // allocate space for the current blocktile in smem
    __shared__ float As[BM * BK];
    __shared__ float Bs[BK * BN];

    // Move blocktile to beginning of A's row and B's column
    A += cRow * BM * K;
    B += cCol * BN;
    C += cRow * BM * N + cCol * BN;

    // calculating the indices that this thread will load into SMEM
    // we'll load 128bit / 32bit = 4 elements per thread at each step
    const uint innerRowA = threadIdx.x / (BK / 4);
    const uint innerColA = threadIdx.x % (BK / 4);
    constexpr uint rowStrideA = (K9_NUM_THREADS * 4) / BK;
    const uint innerRowB = threadIdx.x / (BN / 4);
    const uint innerColB = threadIdx.x % (BN / 4);
    constexpr uint rowStrideB = K9_NUM_THREADS / (BN / 4);

    // allocate thread-local cache for results in registerfile
    float threadResults[WMITER * WNITER * TM * TN] = {0.0};
    float regM[TM] = {0.0};
    float regN[TN] = {0.0};

    // outer-most loop over block tiles
    for (uint bkIdx = 0; bkIdx < K; bkIdx += BK) {
        // populate the SMEM caches
        for (uint offset = 0; offset + rowStrideA <= BM; offset += rowStrideA) {
        float4 tmp = reinterpret_cast<float4 *>(
            &A[(innerRowA + offset) * K + innerColA * 4])[0];
        // transpose A while storing it
        As[(innerColA * 4 + 0) * BM + innerRowA + offset] = tmp.x;
        As[(innerColA * 4 + 1) * BM + innerRowA + offset] = tmp.y;
        As[(innerColA * 4 + 2) * BM + innerRowA + offset] = tmp.z;
        As[(innerColA * 4 + 3) * BM + innerRowA + offset] = tmp.w;
        }

        for (uint offset = 0; offset + rowStrideB <= BK; offset += rowStrideB) {
        reinterpret_cast<float4 *>(
            &Bs[(innerRowB + offset) * BN + innerColB * 4])[0] =
            reinterpret_cast<float4 *>(
                &B[(innerRowB + offset) * N + innerColB * 4])[0];
        }
        __syncthreads();

        for (uint wmIdx = 0; wmIdx < WMITER; ++wmIdx) {
        for (uint wnIdx = 0; wnIdx < WNITER; ++wnIdx) {
            // calculate per-thread results
            for (uint dotIdx = 0; dotIdx < BK; ++dotIdx) {
            // block into registers
            for (uint i = 0; i < TM; ++i) {
                regM[i] = As[dotIdx * BM + (wmIdx * WM) + threadRow * TM + i];
            }
            for (uint i = 0; i < TN; ++i) {
                regN[i] = Bs[dotIdx * BN + (wnIdx * WN) + threadCol * TN + i];
            }
            for (uint resIdxM = 0; resIdxM < TM; ++resIdxM) {
                for (uint resIdxN = 0; resIdxN < TN; ++resIdxN) {
                threadResults[(wmIdx * TM + resIdxM) * (WNITER * TN) +
                                wnIdx * TN + resIdxN] +=
                    regM[resIdxM] * regN[resIdxN];
                }
            }
            }
        }
        }
        __syncthreads();
        // advance blocktile
        A += BK;     // move BK columns to right
        B += BK * N; // move BK rows down
    }

    // write out the results
    for (uint wmIdx = 0; wmIdx < WMITER; ++wmIdx) {
        for (uint wnIdx = 0; wnIdx < WNITER; ++wnIdx) {
        float *C_interim = C + (wmIdx * WM * N) + (wnIdx * WN);
        for (uint resIdxM = 0; resIdxM < TM; resIdxM += 1) {
            for (uint resIdxN = 0; resIdxN < TN; resIdxN += 4) {
            // load C vector into registers
            float4 tmp = reinterpret_cast<float4 *>(
                &C_interim[(threadRow * TM + resIdxM) * N + threadCol * TN +
                            resIdxN])[0];
            // perform GEMM update in reg
            const int i =
                (wmIdx * TM + resIdxM) * (WNITER * TN) + wnIdx * TN + resIdxN;
            tmp.x = alpha * threadResults[i + 0] + beta * tmp.x;
            tmp.y = alpha * threadResults[i + 1] + beta * tmp.y;
            tmp.z = alpha * threadResults[i + 2] + beta * tmp.z;
            tmp.w = alpha * threadResults[i + 3] + beta * tmp.w;
            // write back
            reinterpret_cast<float4 *>(&C_interim[(threadRow * TM + resIdxM) * N +
                                                    threadCol * TN + resIdxN])[0] =
                tmp;
            }
        }
        }
    }

    if (threadIdx.x == 0) {
        int idx = blockIdx.x + blockIdx.y * gridDim.x;
        uint ret;
        asm("mov.u32 %0, %smid;" : "=r"(ret) );
        sm_id[idx] = ret;
    }
}


__global__ void fp64_kernel(double *a, double *b, double *c, long long num_iters, int *sm_id){
    double op1 = a[threadIdx.x];
    double op2 = b[threadIdx.x];
    double op3 = 1.0f;
    double op4 = 1.0f;
    double op5 = 1.0f;
    double op6 = 1.0f;
    for (long long i = 0; i < num_iters; i++){
        op3 = __dmul_rn(op1, op3); 
        op4 = __dmul_rn(op2, op4);
        op5 = __dmul_rn(op1, op5);
        op6 = __dmul_rn(op2, op6);
    }
    c[threadIdx.x] = op3 + op4 + op5 + op6;
    if (threadIdx.x == 0){
        uint ret;
        asm("mov.u32 %0, %smid;" : "=r"(ret) );
        sm_id[blockIdx.x] = ret;
    }
}

void cudaCheck(cudaError_t error) {
  if (error != cudaSuccess) {
    printf("[CUDA ERROR]:\n%s\n", cudaGetErrorString(error));
    exit(EXIT_FAILURE);
  }
};

void runSgemmAutotuned(int M, int N, int K, float alpha, float *A, float *B,
                       float beta, float *C, int *sm_id, cudaStream_t stream) {
  // copied from https://github.com/siboehm/SGEMM_CUDA/blob/master/src/kernels/9_kernel_autotuned.cuh
  const uint K9_BK = 16;
  const uint K9_TM = 8;
  const uint K9_TN = 8;
  const uint K9_BM = 128;
  const uint K9_BN = 128;
  dim3 blockDim(K9_NUM_THREADS);
  dim3 gridDim(CEIL_DIV(N, K9_BN), CEIL_DIV(M, K9_BM));
  sgemmAutotuned<K9_BM, K9_BN, K9_BK, K9_TM, K9_TN>
      <<<gridDim, blockDim, 0, stream>>>(M, N, K, alpha, A, B, beta, C, sm_id);
}


void runFp64Kernel(double *a, double *b, double *c, long long num_iters, int *sm_id, cudaStream_t stream){
    fp64_kernel<<<dim3(132), dim3(128), 0, stream>>>(a, b, c, num_iters, sm_id);
}


void randomize_matrix(float *mat, int N) {
  // NOTICE: Use gettimeofday instead of srand((unsigned)time(NULL)); the time
  // precision is too low and the same random number is generated.
  struct timeval time {};
  gettimeofday(&time, nullptr);
  srand(time.tv_usec);
  for (int i = 0; i < N; i++) {
    float tmp = (float)(rand() % 5) + 0.01 * (rand() % 5);
    tmp = (rand() % 2 == 0) ? tmp : tmp * (-1.);
    mat[i] = tmp;
  }
}


void allocateSgemmMemory(int input_dim, int num_runs, float **A, float **B, float **C, float **C_ref,
                         int **sm_id, float **dA, float **dB, float **dC, float **dC_ref, int **dsm_id) {
  // number of thread blocks
  size_t num_sm_ids = num_runs * SGEMM_THREAD_BLOCKS;

  *A = (float *)malloc(sizeof(float) * input_dim * input_dim);
  *B = (float *)malloc(sizeof(float) * input_dim * input_dim);
  *C = (float *)malloc(sizeof(float) * input_dim * input_dim);
  *C_ref = (float *)malloc(sizeof(float) * input_dim * input_dim);
  *sm_id = (int *)malloc(sizeof(int) * num_sm_ids);

  randomize_matrix(*A, input_dim * input_dim);
  randomize_matrix(*B, input_dim * input_dim);
  randomize_matrix(*C, input_dim * input_dim);

  // initalize sm_ids to -1
  for (int i = 0; i < num_sm_ids; i++){
    (*sm_id)[i] = -1;
  }

  cudaCheck(cudaMalloc((void **)dA, sizeof(float) * input_dim * input_dim));
  cudaCheck(cudaMalloc((void **)dB, sizeof(float) * input_dim * input_dim));
  cudaCheck(cudaMalloc((void **)dC, sizeof(float) * input_dim * input_dim));
  cudaCheck(cudaMalloc((void **)dC_ref, sizeof(float) * input_dim * input_dim));
  cudaCheck(cudaMalloc(dsm_id, sizeof(int) * num_sm_ids));

  cudaCheck(cudaMemcpy(*dA, *A, sizeof(float) * input_dim * input_dim,
                       cudaMemcpyHostToDevice));
  cudaCheck(cudaMemcpy(*dB, *B, sizeof(float) * input_dim * input_dim,
                       cudaMemcpyHostToDevice));
  cudaCheck(cudaMemcpy(*dC, *C, sizeof(float) * input_dim * input_dim,
                       cudaMemcpyHostToDevice));
  cudaCheck(cudaMemcpy(*dC_ref, *C, sizeof(float) * input_dim * input_dim,
                       cudaMemcpyHostToDevice));
  cudaCheck(cudaMemcpy(*dsm_id, *sm_id, sizeof(int) * num_sm_ids, cudaMemcpyHostToDevice));
}


void allocateFP64Memory(double **a, double **b, double **da, double **db, double **dc,
                        int **sm_id, int **dsm_id, int num_runs){
  size_t size_sm_id = num_runs * FP64_THREAD_BLOCKS;

  *a = (double *)malloc(sizeof(double) * FP64_NUM_THREADS);
  *b = (double *)malloc(sizeof(double) * FP64_NUM_THREADS);
  *sm_id = (int *)malloc(sizeof(int) * size_sm_id);

  cudaCheck(cudaMalloc(da, sizeof(double) * FP64_NUM_THREADS));
  cudaCheck(cudaMalloc(db, sizeof(double) * FP64_NUM_THREADS));
  cudaCheck(cudaMalloc(dc, sizeof(double) * FP64_NUM_THREADS));
  cudaCheck(cudaMalloc(dsm_id, sizeof(int) * size_sm_id));

  for (int i = 0; i < FP64_NUM_THREADS; i++) {
    (*a)[i] = 1.0f;
    (*b)[i] = 1.0f;
  }

  cudaCheck(cudaMemcpy(*da, *a, sizeof(double) * FP64_NUM_THREADS, cudaMemcpyHostToDevice));
  cudaCheck(cudaMemcpy(*db, *b, sizeof(double) * FP64_NUM_THREADS, cudaMemcpyHostToDevice));
}

void freeSgemmMemory(float **A, float **B, float **C, float **C_ref, int **sm_id,
                     float **dA, float **dB, float **dC, float **dC_ref, int **dsm_id) {
  free(*A);
  free(*B);
  free(*C);
  free(*C_ref);
  free(*sm_id);

  cudaCheck(cudaFree(*dA));
  cudaCheck(cudaFree(*dB));
  cudaCheck(cudaFree(*dC));
  cudaCheck(cudaFree(*dC_ref));
  cudaCheck(cudaFree(*dsm_id));
}


void freeFP64Memory(double **a, double **b, double **da, double **db, double **dc,
                    int **sm_id, int **dsm_id){
  free(*a);
  free(*b);
  free(*sm_id);

  cudaCheck(cudaFree(*da));
  cudaCheck(cudaFree(*db));
  cudaCheck(cudaFree(*dc));
  cudaCheck(cudaFree(*dsm_id));
}


int main(int argc, char **argv){
    if (argc != 2){
        std::cerr << "Usage: " << argv[0] << " <mode>" << std::endl;
        return 1;
    }

    int mode = atoi(argv[1]); // 0: colocate, 1: run separate for NCU profiling
    float alpha = 0.5, beta = 3.0;
    int m = SGEMM_INPUT_DIM, n = SGEMM_INPUT_DIM, k = SGEMM_INPUT_DIM;

    // allocate memory for sgemm
    float *A, *B, *C, *C_ref, *dA, *dB, *dC, *dC_ref;
    int *sm_id_sgemm, *dsm_id_sgemm;
    allocateSgemmMemory(SGEMM_INPUT_DIM, SGEMM_RUNS, &A, &B, &C, &C_ref, &sm_id_sgemm,
                        &dA, &dB, &dC, &dC_ref, &dsm_id_sgemm);

    // allocate memory for fp64
    double *a, *b, *da, *db, *dc;
    int *sm_id_fp64, *dsm_id_fp64;
    allocateFP64Memory(&a, &b, &da, &db, &dc, &sm_id_fp64, &dsm_id_fp64, FP64_RUNS);

    cudaStream_t stream_sgemm, stream_fp64;
    cudaCheck(cudaStreamCreateWithFlags(&stream_sgemm, cudaStreamNonBlocking));
    cudaCheck(cudaStreamCreateWithFlags(&stream_fp64, cudaStreamNonBlocking));

    // run each kernel once to load it into memory and potential profiling
    cudaCheck(cudaDeviceSynchronize());
    runSgemmAutotuned(m, n, k, alpha, dA, dB, beta, dC, dsm_id_sgemm, stream_sgemm);
    runFp64Kernel(da, db, dc, FP64_NUM_ITERS, dsm_id_fp64, stream_fp64);
    cudaCheck(cudaDeviceSynchronize());

    // colocate both kernels in separate streams
    if (mode == 0){
        for (int i = 0; i < FP64_RUNS; i++){
            runFp64Kernel(da, db, dc, FP64_NUM_ITERS, dsm_id_fp64 + i * FP64_THREAD_BLOCKS, stream_fp64);
        }

        for (int i = 0; i < SGEMM_RUNS; i++){
            runSgemmAutotuned(m, n, k, alpha, dA, dB, beta, dC, dsm_id_sgemm + i * SGEMM_THREAD_BLOCKS, stream_sgemm);
        }

        cudaCheck(cudaDeviceSynchronize());

        // count number of SMs for each run of each kernel
        cudaCheck(cudaMemcpy(sm_id_fp64, dsm_id_fp64, sizeof(int) * FP64_RUNS * FP64_THREAD_BLOCKS, cudaMemcpyDeviceToHost));
        cudaCheck(cudaMemcpy(sm_id_sgemm, dsm_id_sgemm, sizeof(int) * SGEMM_RUNS * SGEMM_THREAD_BLOCKS, cudaMemcpyDeviceToHost));

        for (int i = 0; i < SGEMM_RUNS; i++){
            std::unordered_set<int> sm_ids_sgemm;
            for (int j = 0; j < SGEMM_THREAD_BLOCKS; j++){
                sm_ids_sgemm.insert(sm_id_sgemm[i * SGEMM_THREAD_BLOCKS + j]);
            }
            std::cout << "[SGEMM] Run " << i << ": " << sm_ids_sgemm.size() << " SMs" << std::endl;
        }

        for (int i = 0; i < FP64_RUNS; i++){
            std::unordered_set<int> sm_ids_fp64;
            for (int j = 0; j < FP64_THREAD_BLOCKS; j++){
                sm_ids_fp64.insert(sm_id_fp64[i * FP64_THREAD_BLOCKS + j]);
            }
            std::cout << "[FP64] Run " << i << ": " << sm_ids_fp64.size() << " SMs" << std::endl;
        }
    }
    
    std::cout << "Done" << std::endl;

    // free memory
    freeSgemmMemory(&A, &B, &C, &C_ref, &sm_id_sgemm, &dA, &dB, &dC, &dC_ref, &dsm_id_sgemm);
    freeFP64Memory(&a, &b, &da, &db, &dc, &sm_id_fp64, &dsm_id_fp64);
    cudaStreamDestroy(stream_sgemm);
    cudaStreamDestroy(stream_fp64);

    return 0;
}

I’m running on a H100 with 132 SMs, CUDA 12.5, driver 555.42.06, nsys 2024.2.3.38-242334140272v0, ncu version 2024.2.1.0

I compile the code with
nvcc -arch=sm_90 -lineinfo -o main main.cu -lnvidia-ml

I collect the NCU report with
ncu -f -o main.ncu-rep --set full ./main 1

and collect the cuda trace with
nsys profile --force-overwrite true -o main.nsys-rep ./main 0

Here are some posts that could be related, but do not provide an answer to my case:

FYI: I experienced the same behavior also when I replaced the sgemm implementation by the torch.mm function that relies on some closed source cublas implementation underneath.

Thanks a lot for any help in advance!

My first recommendation would be to add logic to timestamp the start and end of each thread block (CTA) or each warp by reading %globaltimer (should be in 32ns resolution on GH100) and the SMID. Global timer can be read via inline PTX or std::chrono. If you only look at a single SM you can likely determine if there is ever overlap. If there is overlap is there a pattern AB but never BA.

The issue may be that Kernel A uses shared memory (As, Bs) whereas Kernel B does not user shared memory. The driver will change the L1/SHMEM configuration based upon the needs of the kernel. When kernel B runs by itself on the SM the SM L1/SHMEM config is likely set to minimize SHMEM (only needs driver allocation). This keeps Kernel A from being able to run on the SM as a change requires the SM to be idle (all work has completed).

If Kernel A runs first then Kernel B is likely able to run concurrently. If kernel B gets on an SM with no Kernel A then Kernel A cannot be co-resident if L1/SHMEM is too small.

time-->
Kernel A launches first
SM0     Thread Blocks   [Kernel A                  ]
                                     [Kernel B              ]
        L1/SHMEM        [PREFER SHARED                      ]
        
Kernel B launches first
SM0     Thread Blocks   [Kernel B              ]
                                     SSSSSSSSSSS[Kernel A                      ]
        L1/SHMEM        [PREFER L1             ][PREFER SHARED                 ]
        
        where SSSSS is thread block launch stalled due to L1/SHMEM configuration.

I do not know the current driver heuristic for changing L1/SHMEM; however, I think if you set Kernel B’s cache configuration preference to a percentage large enough for A + B you will improve overlap.

See CUDA C++ Programming Guide on cudaFuncSetAttribute cudaFuncAttributePreferredSharedMemoryCarveout.

In CUDA 12.8 Nsight Compute 2025.1.0 has the metric

tpc__warp_launch_cycles_stalled_shader_cs_reason_shmem_allocation # of cycles where CS warp launch was stalled due to shmem allocation

If you were able to capture a range including both kernels (vs. default serialized kernel profiling) then this would have a high value.

In CUDA 12.8 Nsight System 2024.6.2 the GPU Metrics confgiuration GH100 Compute Triage has the metrics:

Shared Memory Available for Compute (TPC.TriageCompute.tpc__l1tex_mem_shared_data_compute_kilobytes_configured_realtime.sum * 1000)
Shared Memory Allocated for Compute (TPC.TriageCompute.tpc__l1tex_mem_shared_data_compute_lines_allocated_realtime.sum * 128)

These metrics do not look like they have the correct configuration as they should be .avg.per_cycle_elapsed with multipliers 1/2 the listed value. However, I think the trends will show the issue.

The experiment would be to run Kernel A and Kernel B serialized and see the value then run by trying force A then B and B then A.

My expectation is that Kernel A by itself will have both values high. Kernel B will have both values low. When mixing A and B you will get concurrency and Available will be high but Allocated will fall. When B then A you will get no concurrency on each SM and there will be a switch as B thread blocks finish (each SM switching different cycles) then A configuration.

Hi @Greg,
Thanks a lot for your quick and detailed response. Your intuition was correct. Given the low shared memory usage for kernel B, I assume that the driver indeed minimized SHMEM for it such that it doesn’t allow for kernel A to run on the same SM given its higher requirements in SHMEM. Setting the SHMEM to 15% for kernel B, already allows both kernels to run on the same SM, while with 10% you still see them running on different SMs.
Do you know whether under the hood the driver can set this split for each SM individually. Namely in the scenario where I have both kernels running on the GPU at the same time but on different SMs, does each SM have an optimized configuration based on the kernel that is running on the SM or is the same configuration applied for the entire GPU?
The NCU report for some reason shows both kernels to have the CachePreferNone Function Cache configuration set, but I assume that this still refers to the legacy API from Kepler and may be misleading of what is actually going on under the hood. Can you confirm this?

Very interesting also the introduction of the new metric, I haven’t seen any reference to it yet. It would indeed be great, if NCU would support concurrent kernel profiling as you suggested in your example at least for some of the metrics.

Coming quickly back to my Q1. In the scenario where both kernels cannot run on the same SM due to SHMEM constraints. Is my assumption wrong that the thread block scheduler will schedule kernels sequentially? Otherwise how can we end up in a situation where the sgemm kernel (kernel A) runs 64 thread blocks on 38 SMs while the fp64 kernel (kernel B) runs 132 thread blocks on 94 SMs? (I noticed I have a typo in my original question as I refer twice to kernel B, but I cannot edit it anymore unfortunately)

Thanks a lot!

The NCU report for some reason shows both kernels to have the CachePreferNone Function Cache configuration set, but I assume that this still refers to the legacy API from Kepler and may be misleading of what is actually going on under the hood. Can you confirm this?

I can confirm this maps to the older API. I have raised the issue with NCU team. The driver allows this to be set as a hint and may not have a good method to convert from hint to user request or actual execution.

Coming quickly back to my Q1. In the scenario where both kernels cannot run on the same SM due to SHMEM constraints. Is my assumption wrong that the thread block scheduler will schedule kernels sequentially?

The heuristic of the CUDA/Compute Work Distributor is not documented. Through observation you can show it schedules grid launches of the same priority sequentially. Given multiple streams the order that the kernel is launched on the GPU may not be easy to control.

Otherwise how can we end up in a situation where the sgemm kernel (kernel A) runs 64 thread blocks on 38 SMs while the fp64 kernel (kernel B) runs 132 thread blocks on 94 SMs?

More SM level trace data would be required to understand this behavior.

If the kernels were fully serialized then I agree that from observed behavior 1 thread block will be resident on each SM. However, once you run two grids concurrently, the block scheduler will start to distribute blocks for the next grid while the first grid is still running. In the case of A → B there is not an occupancy limiter or SHMEM/L1 configuration limiter so B will run on same SM; however, when you switch from B->A there is a SHMEM/L1 configuration limiter (until you made changes) that can result in A thread blocks on launching when the first B block completes. If the block durations from shortest to longest of the B kernel are sufficient (100s to 1000s of cycles) it is likely that multiple A block may be launched on the same SM (limit in your screen shot is 2 due to Block Limit Registers). This cascades into the next A->B where as many as 16 B blocks could be launched on an SM.

The only true method to understand this behavior would be to capture smid, start timestamp, and end timestamp for every block and visualize the issue.