[kernel switch latency] Successive kernels switch latency

Hi, NVIDIA

I write a simple cuda benchmark to launch the 2 kinds gpu kernels successively. The code is like below:

    for (int i = 0; i < 1000; ++i) {
        if (i % 2 == 0) {
            // launch kernel A
            KernelA_ReLU<T><<<grid_size_A, block_size_A>>>(..);
        } else {
            // launch kernel B
            KernelB_binAdd<T><<<grid_size_B, block_size_B>>>(..);
        }
    }

We find the switch latency between 2 kernels are about 1-3 us in average. Here is the nsys profiling results screenshot. You can see between 2 kernels, the latency is 2.88us. On top of that, we cannot find what happening in this latency time from nsys.

We wonder:

  1. What is the switch latency caused by? e.g. set some GPU status register? invalidate some cache? What’s happening in this latency?
  2. If a model, such as LLM, the latency is what matters because there are much more kernels in such model. Have NVIDIA optimized the switch latency?

Thank you.

Hi, @Robert_Crovella @Greg @njuffa

Sorry to interrupt.
Have you checked the kernel switch latency? We wonder how to optimize such latency.
We are from AI domain and current popular LLM models, launching much more kernels, are sensitive to latency when online inference.

Thank you.

For repeated kernel launches, CUDA graphs can be used.

1 Like

There are three methods to reduce the dependent launch latency:

  1. Reduce the number of kernel parameters used by the dependent kernel.
  2. Use Programmatic Dependent Launch and Synchronization (CUDA C++ Programming Guide)
  3. Use CUDA Graphs (CUDA C++ Programming Guide)

(2), (3), and (2+3) can be used to completely eliminate the dependent launch latency.

In this time there are numerous operations that have to occur:

  1. copy kernel parameters over the stream constant bank
  2. update the state defining the grid launch
  3. launch the grid
  4. determine what SMs can accept new thread blocks
  5. launch thread blocks
  6. fetch instructions for first instruction

If a cudaEvent/CUevent is used to time the kernel then the dependent work would start at (1) and this would be a smaller range. However, NSYS tries to show first instruction executed to memory barrier at the end of the grid so more of the dependent launch latency is exposed.

  • By reducing the dependent kernel input parameters you reduce step 1. This is only interesting if the kernel has a large number of parameters (e.g. > 1024B).
  • By using programmatic launch it is possible to hide (1-6) and have the two kernels overlap.
  • By using CUDA graphs you can eliminate (1-2).
  • By using programmatic launch and CUDA graphs you can eliminate (1-6).
1 Like

Thank you @Greg @striker159 for significant help.
We have some more confusions for:

  1. What does ‘NSYS tries to show first instruction executed to memory barrier at the end of the grid so more of the dependent launch latency is exposed’ mean? Does it mean the NSYS measure kernel start time after the step.6(fetch instructions for first instruction) so that the more latency is exposed shown in NSYS result?
  2. Why CUDA Runtime do not automatically do step.1 and step.2 without CUDA Graph?
  3. Does the kernel switch latency become a performance metric inside NVIDIA? We find the kernel switch latency has regression after upgrading driver and CUDA toolkit from about 1 us to 2.6 us in average on A100(PCIE 40GB).

Thank you.

Not to get ahead of Greg, who is expert in these matters and to whose assessment I defer, that would appear to be the fundamental CUDA kernel launch overhead.

Most of that is the time for required PCIe transactions, so hardware related, the minor part is software overhead in the driver and the CUDA runtime. This number has come down slightly over the years: launch overhead was about 5 microseconds in 2012 with the Kepler architecture using PCIe gen 3. The software portion of the overhead scales approximately linear with single-thread CPU performance.

One could speculate that the basic kernel launch overhead will shrink a bit once GPUs adopt PCIe gen 5 interfaces, with the caveat that improvements with increasing PCIe versions are mostly focused on improving throughput (usually by roughly a factor of 2), with only minor improvements in latency.

My understanding of how CUDA Graphs helps is that it “batches up” kernel launches, thereby largely eliminating the overhead for launching each kernel individually. The traditional strategy to avoid negative performance impact at application level from kernel launch overhead has been to avoid kernels with extremely short run times. More recently (in 2018, I think) this was augmented by the introduction of CUDA Graphs.

2 Likes

The question is specifically about kernel launch latency and @Greg gave a overview and details about it.
However, I think it should be at least mentioned once in the thread that there a lot of alternative techniques employed

  • kernel fusion - merging several serial kernel calls into one hard-wired kernel
  • dynamic parallelism - invoking kernels or graphs from within CUDA device code (not necessarily faster than from host code in the general case, when the kernel order and parameters are known in host code)
  • resident kernels - early on more a hack and exotic, now better supported with cooperative kernel launches; kernels keep running on the GPU and synchronize new work packages
2 Likes
  1. What does ‘NSYS tries to show first instruction executed to memory barrier at the end of the grid so more of the dependent launch latency is exposed’ mean? Does it mean the NSYS measure kernel start time after the step.6(fetch instructions for first instruction) so that the more latency is exposed shown in NSYS result?

There is a time bar on the timeline or as a stat that shows duration of a kernel. This could show one of the following:

  1. Start Timestamp - Any time before Step 6
  2. End Timestamp - Any time after completion of the grid and before launch of dependent work.

If you were to capture duration using a cudaEvent/CUevent then it would be from before Step 1 to after the Grid MEMBAR. However, it may also include time after the MEMBAR when the GPU is working on a different stream and has not switched back to the current stream.

NSYS tries to get the most accurate duration to help users optimize the kernel code. In doing the duration does not include the overhead to launch the kernel including Steps 1 - 6. The first timestamp is just after Step 6.

  1. Why CUDA Runtime do not automatically do step.1 and step.2 without CUDA Graph?

Most optimization are a trade-off. Given there is a trade off CUDA introduced CUDA Graphs and Programmatic Dependent Launch to help get the best performance possible.

  1. Does the kernel switch latency become a performance metric inside NVIDIA? We find the kernel switch latency has regression after upgrading driver and CUDA toolkit from about 1 us to 2.6 us in average on A100(PCIE 40GB).

I highly advise you switch you term from “kernel switch latency” to dependent kernel launch latency. Kernel switch latency is a very different issue. For example, kernel switch latency can be between independent launches and is not measurable in the current tools.

Dependent launch latency and launch latency are extremely important metrics for NVIDIA. When writing a tool such as CUPTI/NSYS/NCU there are trade-offs. Observing grid launches, start timestamps, and end timestamps can (a) add overhead, and (b) introduce false positives. I highly recommend you write a micro-benchmark to test launch latency. CUPTI/NSYS add overhead. You can determine how much overhead by writing micro-benchmarks and comparing the result with the tool and without the tool.

1 Like

Thank you all. Very useful suggestions.

Hi, @njuffa

Most of that is the time for required PCIe transactions, so hardware related, the minor part is software overhead in the driver and the CUDA runtime.

Agree. Because of step1(copy kernel parameters), so most of latency is from hardware PCIe transactions.

Hi, @Greg
Thank you for help.

If you were to capture duration using a cudaEvent/CUevent then it would be from before Step 1 to after the Grid MEMBAR. However, it may also include time after the MEMBAR when the GPU is working on a different stream and has not switched back to the current stream.

In our microbenchmark, we indeed use cudaEvent to measure the kernel execution duration. You mean it could include the time of step1(copy kernel parameters) and Grid MEMBAR? What is the Grid MEMBAR used for?

I highly recommend you write a micro-benchmark to test launch latency.

Yes. We measure the dependent kernel launch latency by a CUDA microbenchmark. Here is the piece of code:

    cudaEvent_t start[round], stop[round];
    for (int i = 0; i < round; ++i) {
        CUDA_CHECK(cudaEventCreate(&start[i]));
        CUDA_CHECK(cudaEventCreate(&stop[i]));
    }

    for (int i = 0; i < round; ++i) {
        CUDA_CHECK(cudaEventRecord(start[i]));
        if (i % 2 == 0) {
            // launch kernel A
            KernelA_ReLU<T><<<grid_size_A, block_size_A>>>(data_A, result_A);
        } else {
            // launch kernel B
            KernelB_binAdd<T><<<grid_size_B, block_size_B>>>(data_B_src1, data_B_src2, result_B);
        }
        CUDA_CHECK(cudaEventRecord(stop[i]));
    }

The kernel-A and kernel-B has no data dependency. We use each cuda event to get each kernel’s execution time. We use the first event and last event to get the E2E time. Finally we can compute the average dependent kernel launch latency. Here is an illustration:


Does it make sense to measure dependent kernel launch latency in this way? Here is the result we measure on the following platform A100 and A4000. Both of them are ampere micro arch. The time unit is us.
image

Because a kernel launch has certain synchronization requirements, kernels launched into the same stream always are quasi-dependent on their predecessor, so the simplest way to measure minimum launch latency is to issue null kernels (kernels with no arguments that perform no work) in a loop. But your method looks equally valid.

The measured launch overhead looks entirely plausible to me and suggests (1) the use of a Linux platform (2) the use of a relatively highly clocked CPU, in other words, approximately optimal given current HW and SW.

Hi, @njuffa
Thank you for help. I will measure the kernel launch latency with the null kernel to avoid the PCIe transaction.

BTW, here is the micro benchmark code we use to measure the latency. Does record event(cudaEventRecord) increase the launch latency because it is called between 2 kernels?

for (int i = 0; i < round; ++i) {
    CUDA_CHECK(cudaEventRecord(start[i])); // record event increase latency?
    if (i % 2 == 0) {
        KernelA_ReLU<<<grid_size_A, block_size_A>>>(...);
    } else {
        KernelB_binAdd<<<grid_size_B, block_size_B>>>(...);
    }
    CUDA_CHECK(cudaEventRecord(stop[i])); // record event increase latency?
}

The overhead of null kernels is only slightly less than that of kernels with a reasonable number of arguments. You can easily test this yourself by using synthetic kernels with 0, …, N arguments and measuring the launch overhead.

It is not necessary to use CUDA events for measuring kernel launch overhead, and in fact I have never done so myself. Instead I have always used a high resolution host timer for that. Something along these lines:

#include <stdio.h>
#include <stdlib.h>

#define REPEATS       (2)
#define LAUNCHES      (1000000)
#define USEC_PER_SEC  (1000000)

// Timer with microsecond resolution
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

__global__ void kernel (void) {}

int main (void)
{
    double start, stop, elapsed;
    int i;

    printf ("\nlaunch overhead WITHOUT per-kernel synchronization\n");
    i = REPEATS;
    do {
        cudaDeviceSynchronize(); // wait until any pending activities complete
        start = second();
        for (int j = 0; j < LAUNCHES; j++) {
            kernel<<<1,1>>>();
        }
        cudaDeviceSynchronize();
        stop = second();
        elapsed = stop - start;
        i--;
        if (!i) printf ("%d launches. Elapsed = %.6f seconds: %f usec/launch\n",
                        LAUNCHES, elapsed, (elapsed / LAUNCHES) * USEC_PER_SEC);
    } while (i);
    printf ("\nlaunch overhead WITH per-kernel synchronization\n");
    i = REPEATS;
    do {
        cudaDeviceSynchronize(); // wait until any pending activities complete
        start = second();
        for (int j = 0; j < LAUNCHES; j++) {
            kernel<<<1,1>>>();
            cudaDeviceSynchronize();
        }
        stop = second();
        elapsed = stop - start;
        i--;
        if (!i) printf ("%d launches. Elapsed = %.6f seconds: %f usec/launch\n",
                        LAUNCHES, elapsed, (elapsed / LAUNCHES) * USEC_PER_SEC);
    } while (i);
    return EXIT_SUCCESS;
}
1 Like

Hi, @njuffa

The overhead of null kernels is only slightly less than that of kernels with a reasonable number of arguments.

I wonder why the overhead of null kernels is slightly less than that of kernels with arguments. Copy arguments to device kernel will trigger PCIe transaction right?

It is not necessary to use CUDA events for measuring kernel launch overhead, and in fact I have never done so myself. Instead I have always used a high resolution host timer for that.

It makes perfect sense to use the empty cuda kernel and higher resolution host timer to measure the kernel launch overhead. I will try to measure it again. Does NVIDIA have any data for the kernel launch overhead on A100, H100?

Thank you. :)

From my limited understanding, a reasonable number of kernel arguments increases the size of the transaction queued into the command buffer, but does not add any additional transactions. If you start with a null kernel, what I would expect to see in terms of kernel launch overhead would be a shallow-sloped linear increase as the number of kernel arguments increases.

I am the wrong person to answer that. I retired from NVIDIA in July of 2014. Historically, NVIDIA has not published such data as this is easy enough to measure by individual programmers.

1 Like

Thank you for help. @njuffa

BTW, we have a confusion for the cudaEventElapsedTime. This API computes the host duration between 2 CUDA Events right? What time it is recorded for these CUDA Events, the host time of the Event is signaled? For example, the preceding kernels are finished and the following CUDA Event is being signaled then the current host time will be recorded to this Event. Am i understanding right?

cudaEventElapsedTime (float* ms, cudaEvent_t start, cudaEvent_t end )

I find my microbenchmark maybe wrong because it cannot measure the kernel launch overhead. Here is an illustration. The orange bar is the launch overhead(latency) we want to measure, the green box is CUDA Event we record to the stream and the blue box is the kernels we have launched.
We find the launch overhead is contained within 2 CUDA Events. When the cudaEventElapsedTime is called with event start0 and event stop0, the return duration doesn’t only contain the kernel0’s execution time, but also the launch latency of kernel0. Or does cudaEventElapsedTime only compute the pure kernel execution time?

Thank you :)