MPS vs no MPS: drastic increase in kernel latency

elpaul · June 14, 2025, 1:44am

Hi all,

I’m seeing some interesting behavior when colocating two kernels on the GPU with and without MPS and I’m having trouble explaining the difference in behavior.

I’d like to colocate a PyTorch Softmax kernel and a custom compute kernel of mine that just does some arithmetic operations and is supposed to keep all its operands in registers. Below is an extract of the code. I call the compute kernel from Python through the ctypes interface (I omitted that code for readability, but happy to share if helpful)


import torch.multiprocessing as mp
import argparse
import json
import os

import torch
from vllm.interference.inter_funcs import kernel_name_dict


def run_softmax(args, barrier):
    logits = torch.rand([16, 128256]).cuda()
    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    if barrier:
        barrier.wait()
    for i in range(100):
        start_event.record()
        probs = logits.softmax(dim=-1, dtype=torch.float32)
        end_event.record()
        torch.cuda.synchronize()
        print(f"Step {i} took {start_event.elapsed_time(end_event)} ms")
    torch.cuda.synchronize()

def run_inter(args, barrier):
    # setting an MPS share depending on the scenario
    # os.environ['CUDA_MPS_ACTIVE_THREAD_PERCENTAGE'] = "50"
    init_args = {"device": torch.device('cuda'), "num_floats": 1024}
    run_args = {"num_tb": args.num_tb, "threads_per_block": 1024, "num_itrs": 500000000}

    # run kernel once, lazy loading
    inter_kernel = interference_class(**init_args)
    inter_kernel.run(**run_args)

    barrier.wait()

    inter_kernel.run(**run_args)
    print(f"Exiting....")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--num_tb", type=int, default=0, help="Number of thread blocks of the interfering kernel")

    user_args = parser.parse_args()
    mp.set_start_method('spawn')

    do_interference = False
    if user_args.inter_tb > 0:
        do_interference = True

    barrier = mp.Barrier(2) if do_interference else None

    softmax = mp.Process(target=run_softmax, args=(user_args, barrier))
    softmax.start()

    if do_interference:
        interf = mp.Process(target=run_inter, args=(user_args, barrier))
        interf.start()

    softmax.join()
    if do_interference:
        interf.join()

// the compute kernel
__global__ void compute_kernel(float *a, float *b, float *c, long long num_itrs)
{
    float op1 = a[threadIdx.x];
    float op2 = b[threadIdx.x];
    float op3 = 1.0f;
    float op4 = 1.0f;
    float op5 = 1.0f;
    float op6 = 1.0f;

    for (long long i = 0; i < num_itrs; i++)
    {
        op3 = __fmaf_rn(op1, op2, op3);
        op4 = __fmaf_rn(op1, op2, op4);
        op5 = __fmaf_rn(op1, op2, op5);
        op6 = __fmaf_rn(op1, op2, op6);
    }
    c[threadIdx.x] = op3 + op4 + op5 + op6;

}

I’m running on a H100 with CUDA 12.9 and driver 575.57.08. I’m seeing the following:

when the softmax runs alone (setting num_tb arg to 0), it’s latency is roughly 65 microseconds
when I colocate with the compute kernel WITHOUT MPS and launch the compute kernel with 132 thread blocks (matching the 132 SMs on the H100), the softmax latency is still on the order of 65 microseconds. I checked with nsys that both kernels indeed overlap. Note that the GPU is in the default compute mode. I also printed the SM IDs of each of the thread block of the compute kernel and can see that all of the 132 SMs host on thread block.
when I colocate with the compute kernel WITH MPS (just running the MPS server) but NOT setting the CUDA_MPS_ACTIVE_THREAD_PERCENTAGE var, the latency of the softmax suddenly increases to 100ms (several orders of magnitude higher). I don’t get what exactly is happening here? How is this case different from the one before? Again I see both kernels overlapping in the nsys trace and that each SM is hosting one thread block of the compute kernel.
when I repeat the previous scenario but this time set the CUDA_MPS_ACTIVE_THREAD_PERCENTAGE to 50, the latency of softmax again decreases to roughly 65 microseconds. My assumption here was that the compute kernel saturates the 66 SMs it is given access to by using all 2048 threads on them. The softmax kernel thus has to run on any of the other 66 SMs and should not suffer from any interference. However I still cannot explain the behavior I’m seeing between the second and third scenario.

Does anyone have an idea on what is happening here? I don’t see how to explain the increase in latency? If it was because of interference, why wouldn’t I see that interference as well in the second scenario? Also I wouldn’t expect such an explosion in latency purely due to interference.

Any help is as always much appreciated!! Thanks a lot!

njuffa · June 14, 2025, 2:14am

For your measurements, are you executing the softmax kernel once, or repeatedly in a loop? If this is measured for one-time execution, the massive increase in execution time is likely indicative of some one-time “cold start” overhead.

To confirm or refute this hypothesis, call the softmax kernel ten times in a row and check whether execution time stabilizes after five-ish iterations, after reaching “steady state”.

General remark: GPU are designed as throughput machines, and performance is therefore best measured in terms of throughput, e.g. number of kernel invocation per second. GPU are not well suited to latency-sensitive use cases. For latency-sensitive workloads, the platform of choice is usually high-frequency CPUs, such as AMD EPYC 9x75F processors.

Curefab · June 14, 2025, 6:18am

Or going down deeper to FPGAs or realtime processor cores running on bare-metal or a RTOS.

elpaul · June 19, 2025, 4:58pm

Apologies for the late reply. To confirm, I’m running the softmax kernel 100 times as indicated in the run_softmax function and its latency is pretty stable across the 100 invocations.

An update on the above actually after some further investigation:

in the first scenario, where I am running both kernels without MPS and nsys shows both of them to overlap: This is a bit misleading from nsys, since the kernels should not be able to overlap without MPS running. This is very likely due to the time-slicing nature of the scheduler and something that has been observed already in other cases as well (see Robert’s response in https://stackoverflow.com/a/69585513). So what I assume ends up happening is that the compute kernel, which is long running (on the order of 8s) will be interrupted multiple times (once for each invocation of the softmax kernel) to context switch and run the softmax kernel instead. Thus the softmax kernel at this time is running alone on the GPU, so there is no interference and the latency is similar to the first scenario where the softmax kernel is just running alone. I would assume that the softmax kernel is never interrupted by the scheduler given its short latency of only 65 microseconds. From nsys it looks like both are running concurrently but actually they are not.
What I am still a bit surprised about is the increase in latency of the 3rd scenario. I’m launching the compute kernel here with 132 thread blocks and 1024 threads per block. The softmax kernel is launched with 16 thread blocks and 1024 threads per block. Since MPS is running, I expect both kernels to actually be overlapping and there to be 16 SMs where one thread block of each kernel is running. But can interference create such a significant slowdown from 65us to 100ms or could it also be MPS related? Because I also checked the following two scenarios where I launch the compute kernel with only 66 thread blocks (softmax remains with 16 thread blocks):
- when I just launch the MPS server and don’t set an MPS share → softmax latency still ~100ms
- when I launch the MPS server and set an MPS shared of 50% (i.e 66 SMs) → softmax latency still ~100ms

In both cases I would have assumed that the 82 (compute + softmax) thread blocks could each be run on a separate SM and as a result there should be no interference. So where is this slowdown coming from? Either interference in which case I would be surprised about the behavior of the scheduler or a reason I’m currently not aware of.

Topic		Replies	Views
cuda kernels from different process can run concurrently? same performance with MPS on and off? CUDA Programming and Performance	9	2111	May 3, 2018
Bug report: Incorrect block scheduling CUDA Programming and Performance	18	7706	February 19, 2010
Why does my kernel take too long occasionally? CUDA Programming and Performance	21	8794	October 13, 2010
Question about GPU sharing of Multi-process service CUDA Programming and Performance	9	6593	April 30, 2018
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16323	January 30, 2011
Question about CUDA kernels parallel execution CUDA Programming and Performance cuda , parallel-computing	7	2675	April 27, 2024
Difference between vGPU and CUDA MPS CUDA Programming and Performance	4	1887	November 25, 2020
MPS is not working CUDA Programming and Performance	7	3204	July 13, 2022
MULTI-PROCESS SERVICE(MPS) has no effect CUDA Programming and Performance	3	811	October 16, 2018
Interaction between Green Contexts, MPS, and GPU resource allocation for parallel kernel execution CUDA Programming and Performance cuda	0	15	June 3, 2025

MPS vs no MPS: drastic increase in kernel latency

Related topics