Interference with different pipelines when IPC < 4

Hi all,

I’m running on a H100 GPU and I have tried to verify/measure how the warp scheduler could become a bottleneck/source of contention when colocating two kernels on the same SM by using CUDA streams with equal priority. I provide a concrete example below.

I have two kernels A and B that target different pipelines. Kernel A uses the FP64 pipeline and Kernel B uses the FMA pipeline (verified in Nsight Compute). The workload of kernel A remains the same and I use kernel B to create interference at the level of the warp scheduler by increasing the amount of IPC. For kernel B I show the example when the loop is unrolled by a factor of 2 which gives me an IPC of roughly 2. Unrolling the loop by a level of 1, 3 and 4 will give me roughly an IPC of 1, 3 and 3.75. I try to limit the amount of memory operations and focus purely on the computation part by loading all required data into registers.

// kernel A -> remains constant
__global__  void mul_fp64(double *a, double *b, double *c){
    double op1 = a[threadIdx.x];
    double op2 = b[threadIdx.x];
    double op3 = 1.0f;
    double op4 = 1.0f;
    for (long long i = 0; i < NUM_ITR/2; i++){
        op3 = __dmul_rn(op1, op3);
        op4 = __dmul_rn(op2, op4);
    }
    c[threadIdx.x] = op3 + op4;
}

// kernel B -> use different levels of ILP to achieve different IPC
__global__ void fma_fp32_ilp2(float *a, float *b, float *c){
    float op1 = a[threadIdx.x];
    float op2 = b[threadIdx.x];
    float op3 = 0.0f;
    float op4 = 0.0f;
    for (long long i = 0; i < NUM_ITR; i++){
        op3 = __fmaf_rn(op1, op2, op3);
        op4 = __fmaf_rn(op1, op2, op4);
    }
    c[threadIdx.x] = op3 + op4;
}

I measure 4 scenarios where I change the ILP level of kernel B to 1,2,3 and 4 to achieve different levels of IPC (measured with Nsight Compute). For all 4 scenarios, kernel A remains constant and issues roughly 1 instruction per cycle per SM. I normalize the latencies in all 4 scenario for kernel A and B relative to when they are running in isolation with the same workload. I launch both kernels with 132 thread blocks (1 per SM) and 128 threads (1 warp per scheduler). I observe the following results

IPC kernel A IPC kernel B normalized latency A normalized latency B
1.03 1.01 0.87 0.86
1.03 2.01 0.73 0.74
1.03 2.95 0.61 0.82
1.03 3.75 0.5 0.99

I would have expected since the kernels use different pipelines (which I assume are independent), that I would only start to see significant slowdowns once I get close to the 4 IPC in sum. However the first 2 scenarios show that both kernels already slow down significantly even though the sum in IPC is below 4. I’m trying to figure out what might be causing this:

  • could this be a sign that the FMA and FP64 pipelines are not totally independent? I repeated the same experiment and replaced kernel A to use the tensor core pipelines and observed similar behaviors. From this post Separate CUDA Core pipeline for FP16 and FP32? I understand that they should not share the same dispatch port at least.
  • could it be pollution in the L0 instruction cache by running both kernels concurrently? I have not found a way to somehow measure/profile this. I would be surprised however that the L0 cache could cause such slowdowns on its own, though I don’t know any numbers on the overhead to pay upon a miss.
  • any other ideas what might be causing this? (contention of register banks?)

It’s interesting to see that up until scenario 3, both kernels seem to suffer equally and only once we exceed the 4 IPC, the slowdown of both kernels seems to reflect that kernel B is issuing more instructions than kernel A and it therefore suffers less. It’s a bit surprising though that towards the end kernel A is taking all the losses.

I appreciate your help as always!