Hi all,
I’m experiencing some unexpected behavior of the thread block scheduler when colocating two kernels A and B with CUDA streams. My initial idea was to measure the interference that kernel B is creating within the SM when colocated with kernel A on the GPU. However I noticed that both kernels end up being scheduled on separate set of SMs even though they should be able to run on the same SM. I’m running on a H100 GPU with 132 SMs.
I attached the entire code to reproduce the behavior below. I was not able to replicate the behavior with a smaller example unfortunately (apologies).
Here’s what I’m doing:
For kernel A, I’m running the SGEMM implementation from here SGEMM_CUDA/src/kernels/9_kernel_autotuned.cuh at master · siboehm/SGEMM_CUDA · GitHub. I added a small snippet to the kernel (see code below), that saves the SM ID that each thread block is running on. I run the kernel with matrix dimensions m=n=k=1024. The launch configuration is (8,8,1)x(256,1,1).
For kernel B, I’m using a custom kernel, that just keeps executing some fp64 multiplications. Once again I have each thread block save the SM ID it is running on. I launch this kernel with (132,1,1)x(128,1,1).
When profiling both kernels with NCU, their occupancy section looks as follows:
-
kernel A
-
kernel B
Based on the occupancy and the launch configuration, I would expect that both kernels can execute concurrently on the same SM, given that neither the max blocks, max regs, max shared mem, max warps per block would be violated.
But when running the program, the behavior is not as expected. I first execute each kernel once, to load it into memory and avoid any potential lazy loading. I then launch the sgemm kernel 150 times in its own stream and I launch the fp64 kernel 4 times in a separate stream, each time with 5000000 iterations such that both kernels have the opportunity to execute at the same time.
However the thread block scheduler does not seem to allow both kernels to run on the same SMs. The sgemm kernel doesn’t start until the very end of the FP64 kernel.
Given that the FP64 kernel is launched with 132 thread blocks, it runs one thread block per SM (I validated this by saving the SM IDs of each thread block). From the screenshot below it looks like once the first thread blocks from the sgemm kernels complete and release the SM, the first thread blocks of the FP64 kernel can be scheduled on them.
By saving the SM IDs of each thread block from both kernels, I am able to see that the FP64 kernel uses:
- 132 SMs for run 1
- 94 SMs for run 2
On the other hand the fp64 kernel uses 38 SMs for runs 1 to 81. This strongly supports the case that both kernels are running on separate set of SMs. When rerunning the application multiple times, the sum of SMs used between the sgemm kernel in run 2 and the FP64 kernel is always 132, though the split slightly differs.
Q1: How come the split is able to vary? My current understanding so far was, that the thread block scheduler would first schedule all blocks from one kernel before scheduling blocks from the next kernel. Here it looks like it interleaves the scheduling of both kernels’ blocks.
Q2: What kind of factors would make the thread block scheduler prefer to run kernels on separate sets of SMs rather than on the same SM? Is it trying to avoid interference within the SMs for example? I am not running any MPS server here and consequently do not set the CUDA_MPS_ACTIVE_THREAD_PERCENTAGE variable either.
I however validated the same behavior when running both kernels as part of separate processes and MPS enabled. I did not set the CUDA_MPS_ACTIVE_THREAD_PERCENTAGE variable for any of the processes.
Below is the entire code.
#include <iostream>
#include <cmath>
#include <cstdio>
#include <sys/time.h>
#include <unordered_set>
#include <cuda_runtime.h>
#define SGEMM_THREAD_BLOCKS 64
#define FP64_THREAD_BLOCKS 132
#define FP64_NUM_THREADS 128
#define SGEMM_INPUT_DIM 1024
#define SGEMM_RUNS 150
#define FP64_RUNS 4
#define FP64_NUM_ITERS 5000000
#define CEIL_DIV(M, N) (((M) + (N)-1) / (N))
const int K9_NUM_THREADS = 256;
template <const int BM, const int BN, const int BK, const int TM, const int TN>
__global__ void __launch_bounds__(K9_NUM_THREADS)
sgemmAutotuned(int M, int N, int K, float alpha, float *A, float *B,
float beta, float *C, int* sm_id) {
// copied from https://github.com/siboehm/SGEMM_CUDA/blob/master/src/kernels/9_kernel_autotuned.cuh
const uint cRow = blockIdx.y;
const uint cCol = blockIdx.x;
// size of warptile
constexpr int WM = TM * 16;
constexpr int WN = TN * 16;
// iterations of warptile
constexpr int WMITER = CEIL_DIV(BM, WM);
constexpr int WNITER = CEIL_DIV(BN, WN);
// Placement of the thread in the warptile
const int threadCol = threadIdx.x % (WN / TN);
const int threadRow = threadIdx.x / (WN / TN);
// allocate space for the current blocktile in smem
__shared__ float As[BM * BK];
__shared__ float Bs[BK * BN];
// Move blocktile to beginning of A's row and B's column
A += cRow * BM * K;
B += cCol * BN;
C += cRow * BM * N + cCol * BN;
// calculating the indices that this thread will load into SMEM
// we'll load 128bit / 32bit = 4 elements per thread at each step
const uint innerRowA = threadIdx.x / (BK / 4);
const uint innerColA = threadIdx.x % (BK / 4);
constexpr uint rowStrideA = (K9_NUM_THREADS * 4) / BK;
const uint innerRowB = threadIdx.x / (BN / 4);
const uint innerColB = threadIdx.x % (BN / 4);
constexpr uint rowStrideB = K9_NUM_THREADS / (BN / 4);
// allocate thread-local cache for results in registerfile
float threadResults[WMITER * WNITER * TM * TN] = {0.0};
float regM[TM] = {0.0};
float regN[TN] = {0.0};
// outer-most loop over block tiles
for (uint bkIdx = 0; bkIdx < K; bkIdx += BK) {
// populate the SMEM caches
for (uint offset = 0; offset + rowStrideA <= BM; offset += rowStrideA) {
float4 tmp = reinterpret_cast<float4 *>(
&A[(innerRowA + offset) * K + innerColA * 4])[0];
// transpose A while storing it
As[(innerColA * 4 + 0) * BM + innerRowA + offset] = tmp.x;
As[(innerColA * 4 + 1) * BM + innerRowA + offset] = tmp.y;
As[(innerColA * 4 + 2) * BM + innerRowA + offset] = tmp.z;
As[(innerColA * 4 + 3) * BM + innerRowA + offset] = tmp.w;
}
for (uint offset = 0; offset + rowStrideB <= BK; offset += rowStrideB) {
reinterpret_cast<float4 *>(
&Bs[(innerRowB + offset) * BN + innerColB * 4])[0] =
reinterpret_cast<float4 *>(
&B[(innerRowB + offset) * N + innerColB * 4])[0];
}
__syncthreads();
for (uint wmIdx = 0; wmIdx < WMITER; ++wmIdx) {
for (uint wnIdx = 0; wnIdx < WNITER; ++wnIdx) {
// calculate per-thread results
for (uint dotIdx = 0; dotIdx < BK; ++dotIdx) {
// block into registers
for (uint i = 0; i < TM; ++i) {
regM[i] = As[dotIdx * BM + (wmIdx * WM) + threadRow * TM + i];
}
for (uint i = 0; i < TN; ++i) {
regN[i] = Bs[dotIdx * BN + (wnIdx * WN) + threadCol * TN + i];
}
for (uint resIdxM = 0; resIdxM < TM; ++resIdxM) {
for (uint resIdxN = 0; resIdxN < TN; ++resIdxN) {
threadResults[(wmIdx * TM + resIdxM) * (WNITER * TN) +
wnIdx * TN + resIdxN] +=
regM[resIdxM] * regN[resIdxN];
}
}
}
}
}
__syncthreads();
// advance blocktile
A += BK; // move BK columns to right
B += BK * N; // move BK rows down
}
// write out the results
for (uint wmIdx = 0; wmIdx < WMITER; ++wmIdx) {
for (uint wnIdx = 0; wnIdx < WNITER; ++wnIdx) {
float *C_interim = C + (wmIdx * WM * N) + (wnIdx * WN);
for (uint resIdxM = 0; resIdxM < TM; resIdxM += 1) {
for (uint resIdxN = 0; resIdxN < TN; resIdxN += 4) {
// load C vector into registers
float4 tmp = reinterpret_cast<float4 *>(
&C_interim[(threadRow * TM + resIdxM) * N + threadCol * TN +
resIdxN])[0];
// perform GEMM update in reg
const int i =
(wmIdx * TM + resIdxM) * (WNITER * TN) + wnIdx * TN + resIdxN;
tmp.x = alpha * threadResults[i + 0] + beta * tmp.x;
tmp.y = alpha * threadResults[i + 1] + beta * tmp.y;
tmp.z = alpha * threadResults[i + 2] + beta * tmp.z;
tmp.w = alpha * threadResults[i + 3] + beta * tmp.w;
// write back
reinterpret_cast<float4 *>(&C_interim[(threadRow * TM + resIdxM) * N +
threadCol * TN + resIdxN])[0] =
tmp;
}
}
}
}
if (threadIdx.x == 0) {
int idx = blockIdx.x + blockIdx.y * gridDim.x;
uint ret;
asm("mov.u32 %0, %smid;" : "=r"(ret) );
sm_id[idx] = ret;
}
}
__global__ void fp64_kernel(double *a, double *b, double *c, long long num_iters, int *sm_id){
double op1 = a[threadIdx.x];
double op2 = b[threadIdx.x];
double op3 = 1.0f;
double op4 = 1.0f;
double op5 = 1.0f;
double op6 = 1.0f;
for (long long i = 0; i < num_iters; i++){
op3 = __dmul_rn(op1, op3);
op4 = __dmul_rn(op2, op4);
op5 = __dmul_rn(op1, op5);
op6 = __dmul_rn(op2, op6);
}
c[threadIdx.x] = op3 + op4 + op5 + op6;
if (threadIdx.x == 0){
uint ret;
asm("mov.u32 %0, %smid;" : "=r"(ret) );
sm_id[blockIdx.x] = ret;
}
}
void cudaCheck(cudaError_t error) {
if (error != cudaSuccess) {
printf("[CUDA ERROR]:\n%s\n", cudaGetErrorString(error));
exit(EXIT_FAILURE);
}
};
void runSgemmAutotuned(int M, int N, int K, float alpha, float *A, float *B,
float beta, float *C, int *sm_id, cudaStream_t stream) {
// copied from https://github.com/siboehm/SGEMM_CUDA/blob/master/src/kernels/9_kernel_autotuned.cuh
const uint K9_BK = 16;
const uint K9_TM = 8;
const uint K9_TN = 8;
const uint K9_BM = 128;
const uint K9_BN = 128;
dim3 blockDim(K9_NUM_THREADS);
dim3 gridDim(CEIL_DIV(N, K9_BN), CEIL_DIV(M, K9_BM));
sgemmAutotuned<K9_BM, K9_BN, K9_BK, K9_TM, K9_TN>
<<<gridDim, blockDim, 0, stream>>>(M, N, K, alpha, A, B, beta, C, sm_id);
}
void runFp64Kernel(double *a, double *b, double *c, long long num_iters, int *sm_id, cudaStream_t stream){
fp64_kernel<<<dim3(132), dim3(128), 0, stream>>>(a, b, c, num_iters, sm_id);
}
void randomize_matrix(float *mat, int N) {
// NOTICE: Use gettimeofday instead of srand((unsigned)time(NULL)); the time
// precision is too low and the same random number is generated.
struct timeval time {};
gettimeofday(&time, nullptr);
srand(time.tv_usec);
for (int i = 0; i < N; i++) {
float tmp = (float)(rand() % 5) + 0.01 * (rand() % 5);
tmp = (rand() % 2 == 0) ? tmp : tmp * (-1.);
mat[i] = tmp;
}
}
void allocateSgemmMemory(int input_dim, int num_runs, float **A, float **B, float **C, float **C_ref,
int **sm_id, float **dA, float **dB, float **dC, float **dC_ref, int **dsm_id) {
// number of thread blocks
size_t num_sm_ids = num_runs * SGEMM_THREAD_BLOCKS;
*A = (float *)malloc(sizeof(float) * input_dim * input_dim);
*B = (float *)malloc(sizeof(float) * input_dim * input_dim);
*C = (float *)malloc(sizeof(float) * input_dim * input_dim);
*C_ref = (float *)malloc(sizeof(float) * input_dim * input_dim);
*sm_id = (int *)malloc(sizeof(int) * num_sm_ids);
randomize_matrix(*A, input_dim * input_dim);
randomize_matrix(*B, input_dim * input_dim);
randomize_matrix(*C, input_dim * input_dim);
// initalize sm_ids to -1
for (int i = 0; i < num_sm_ids; i++){
(*sm_id)[i] = -1;
}
cudaCheck(cudaMalloc((void **)dA, sizeof(float) * input_dim * input_dim));
cudaCheck(cudaMalloc((void **)dB, sizeof(float) * input_dim * input_dim));
cudaCheck(cudaMalloc((void **)dC, sizeof(float) * input_dim * input_dim));
cudaCheck(cudaMalloc((void **)dC_ref, sizeof(float) * input_dim * input_dim));
cudaCheck(cudaMalloc(dsm_id, sizeof(int) * num_sm_ids));
cudaCheck(cudaMemcpy(*dA, *A, sizeof(float) * input_dim * input_dim,
cudaMemcpyHostToDevice));
cudaCheck(cudaMemcpy(*dB, *B, sizeof(float) * input_dim * input_dim,
cudaMemcpyHostToDevice));
cudaCheck(cudaMemcpy(*dC, *C, sizeof(float) * input_dim * input_dim,
cudaMemcpyHostToDevice));
cudaCheck(cudaMemcpy(*dC_ref, *C, sizeof(float) * input_dim * input_dim,
cudaMemcpyHostToDevice));
cudaCheck(cudaMemcpy(*dsm_id, *sm_id, sizeof(int) * num_sm_ids, cudaMemcpyHostToDevice));
}
void allocateFP64Memory(double **a, double **b, double **da, double **db, double **dc,
int **sm_id, int **dsm_id, int num_runs){
size_t size_sm_id = num_runs * FP64_THREAD_BLOCKS;
*a = (double *)malloc(sizeof(double) * FP64_NUM_THREADS);
*b = (double *)malloc(sizeof(double) * FP64_NUM_THREADS);
*sm_id = (int *)malloc(sizeof(int) * size_sm_id);
cudaCheck(cudaMalloc(da, sizeof(double) * FP64_NUM_THREADS));
cudaCheck(cudaMalloc(db, sizeof(double) * FP64_NUM_THREADS));
cudaCheck(cudaMalloc(dc, sizeof(double) * FP64_NUM_THREADS));
cudaCheck(cudaMalloc(dsm_id, sizeof(int) * size_sm_id));
for (int i = 0; i < FP64_NUM_THREADS; i++) {
(*a)[i] = 1.0f;
(*b)[i] = 1.0f;
}
cudaCheck(cudaMemcpy(*da, *a, sizeof(double) * FP64_NUM_THREADS, cudaMemcpyHostToDevice));
cudaCheck(cudaMemcpy(*db, *b, sizeof(double) * FP64_NUM_THREADS, cudaMemcpyHostToDevice));
}
void freeSgemmMemory(float **A, float **B, float **C, float **C_ref, int **sm_id,
float **dA, float **dB, float **dC, float **dC_ref, int **dsm_id) {
free(*A);
free(*B);
free(*C);
free(*C_ref);
free(*sm_id);
cudaCheck(cudaFree(*dA));
cudaCheck(cudaFree(*dB));
cudaCheck(cudaFree(*dC));
cudaCheck(cudaFree(*dC_ref));
cudaCheck(cudaFree(*dsm_id));
}
void freeFP64Memory(double **a, double **b, double **da, double **db, double **dc,
int **sm_id, int **dsm_id){
free(*a);
free(*b);
free(*sm_id);
cudaCheck(cudaFree(*da));
cudaCheck(cudaFree(*db));
cudaCheck(cudaFree(*dc));
cudaCheck(cudaFree(*dsm_id));
}
int main(int argc, char **argv){
if (argc != 2){
std::cerr << "Usage: " << argv[0] << " <mode>" << std::endl;
return 1;
}
int mode = atoi(argv[1]); // 0: colocate, 1: run separate for NCU profiling
float alpha = 0.5, beta = 3.0;
int m = SGEMM_INPUT_DIM, n = SGEMM_INPUT_DIM, k = SGEMM_INPUT_DIM;
// allocate memory for sgemm
float *A, *B, *C, *C_ref, *dA, *dB, *dC, *dC_ref;
int *sm_id_sgemm, *dsm_id_sgemm;
allocateSgemmMemory(SGEMM_INPUT_DIM, SGEMM_RUNS, &A, &B, &C, &C_ref, &sm_id_sgemm,
&dA, &dB, &dC, &dC_ref, &dsm_id_sgemm);
// allocate memory for fp64
double *a, *b, *da, *db, *dc;
int *sm_id_fp64, *dsm_id_fp64;
allocateFP64Memory(&a, &b, &da, &db, &dc, &sm_id_fp64, &dsm_id_fp64, FP64_RUNS);
cudaStream_t stream_sgemm, stream_fp64;
cudaCheck(cudaStreamCreateWithFlags(&stream_sgemm, cudaStreamNonBlocking));
cudaCheck(cudaStreamCreateWithFlags(&stream_fp64, cudaStreamNonBlocking));
// run each kernel once to load it into memory and potential profiling
cudaCheck(cudaDeviceSynchronize());
runSgemmAutotuned(m, n, k, alpha, dA, dB, beta, dC, dsm_id_sgemm, stream_sgemm);
runFp64Kernel(da, db, dc, FP64_NUM_ITERS, dsm_id_fp64, stream_fp64);
cudaCheck(cudaDeviceSynchronize());
// colocate both kernels in separate streams
if (mode == 0){
for (int i = 0; i < FP64_RUNS; i++){
runFp64Kernel(da, db, dc, FP64_NUM_ITERS, dsm_id_fp64 + i * FP64_THREAD_BLOCKS, stream_fp64);
}
for (int i = 0; i < SGEMM_RUNS; i++){
runSgemmAutotuned(m, n, k, alpha, dA, dB, beta, dC, dsm_id_sgemm + i * SGEMM_THREAD_BLOCKS, stream_sgemm);
}
cudaCheck(cudaDeviceSynchronize());
// count number of SMs for each run of each kernel
cudaCheck(cudaMemcpy(sm_id_fp64, dsm_id_fp64, sizeof(int) * FP64_RUNS * FP64_THREAD_BLOCKS, cudaMemcpyDeviceToHost));
cudaCheck(cudaMemcpy(sm_id_sgemm, dsm_id_sgemm, sizeof(int) * SGEMM_RUNS * SGEMM_THREAD_BLOCKS, cudaMemcpyDeviceToHost));
for (int i = 0; i < SGEMM_RUNS; i++){
std::unordered_set<int> sm_ids_sgemm;
for (int j = 0; j < SGEMM_THREAD_BLOCKS; j++){
sm_ids_sgemm.insert(sm_id_sgemm[i * SGEMM_THREAD_BLOCKS + j]);
}
std::cout << "[SGEMM] Run " << i << ": " << sm_ids_sgemm.size() << " SMs" << std::endl;
}
for (int i = 0; i < FP64_RUNS; i++){
std::unordered_set<int> sm_ids_fp64;
for (int j = 0; j < FP64_THREAD_BLOCKS; j++){
sm_ids_fp64.insert(sm_id_fp64[i * FP64_THREAD_BLOCKS + j]);
}
std::cout << "[FP64] Run " << i << ": " << sm_ids_fp64.size() << " SMs" << std::endl;
}
}
std::cout << "Done" << std::endl;
// free memory
freeSgemmMemory(&A, &B, &C, &C_ref, &sm_id_sgemm, &dA, &dB, &dC, &dC_ref, &dsm_id_sgemm);
freeFP64Memory(&a, &b, &da, &db, &dc, &sm_id_fp64, &dsm_id_fp64);
cudaStreamDestroy(stream_sgemm);
cudaStreamDestroy(stream_fp64);
return 0;
}
I’m running on a H100 with 132 SMs, CUDA 12.5, driver 555.42.06, nsys 2024.2.3.38-242334140272v0, ncu version 2024.2.1.0
I compile the code with
nvcc -arch=sm_90 -lineinfo -o main main.cu -lnvidia-ml
I collect the NCU report with
ncu -f -o main.ncu-rep --set full ./main 1
and collect the cuda trace with
nsys profile --force-overwrite true -o main.nsys-rep ./main 0
Here are some posts that could be related, but do not provide an answer to my case:
FYI: I experienced the same behavior also when I replaced the sgemm implementation by the torch.mm function that relies on some closed source cublas implementation underneath.
Thanks a lot for any help in advance!