Any way to measure the latency of a kernel launch?

I am making a benchmarking program to automate the collection of timings data for individual kernels in a larger library / software suite I am building. Making this additional test program has provided the extra benefit of pushing me to re-think the setup and come at things from another perspective, which has uncovered a bug or two.

But, on to the question… I designed my kernels around “work units” such that each thread block will take a work unit, do it, then reach for the next one in an asynchronous fashion. Over a large workload, however, the blocks should all get done in about as quick a manner as possible. This approach has worked well in another program I worked on, and I am adding some polish to the system. One would hypothesize that, so long as the card is evenly filled by X work units, then providing 2X work units would require twice as long to compute, plus some launch latency. The card is being bombarded with one workload after another, but overall (so far) the workloads are not driving up the temperature of the card, likely because of the length of the longer-running kernel and the fact that its work units are somewhat sparse in this test.

When I do this experiment with each of two kernels, I find that the results are indeed linear in the workload, but I get some pretty striking results. I am running these experiments with a number of work units that is a direct multiple of the number of streaming multiprocessors and thus the number of blocks in the launch grid, and in fact each work unit calls for a replica of the same evaluations, so the load is very well balanced. With the faster kernel, the trendline suggests a launch latency of about 11 microseconds. With the slower kernel, the trendline suggests a launch latency of 0 or perhaps even -1 microseconds (so, zero). The first result seems within reason, though perhaps a bit higher than I usually hear kernel launch latency quoted at, but the second result seems a bit odd–I’d expect at least 5, especially given that the slower kernel involves much more detailed code (though the reason it is slower is not because it does more math, it actually does much less math but involves a lot more atomicAdd() operations).

I am running these calculations on an A40, the top-end visualization card, but I believe I am getting MAD returns off of the floating-point arithmetic two-fer that the GA102 series offers.

Can anyone comment here–does kernel launch latency indeed scale with the complexity of the kernel, is 11 microseconds a launch latency one might expect for a relatively simple kernel requiring 40 registers per thread and launched with five 256-threaded blocks per SM? Is it plausible that a very complex kernel (about 1600 lines in all) with 57-60 registers per thread launched on one 1024-threaded block per SM might have a very low launch latency?

Cheers!

(Edit: I am looking at several other systems of similar sizes, and the results remain more or less the same–7 microseconds’ estimated latency in the other systems for the fast kernel, and 0 to “negative 1” microseconds’ estimated latency in the slower kernel.)

Kernel launch overhead is frequently cited as 5 microseconds. That is based on measurements using a wave of null kernels, that is, back to back launching of an empty kernel that does not do anything, i.e. exits immediately. One finds that there is a hard limit of around 200,000 such launches per second.

This has been well reproducible for many years, and is basically a function of the hardware mechanisms involved. I have seen some indications that this overhead has reduced slightly on the latest PCIe4-based GPUs, to about 4 microseconds. My understanding of the PCIe transactions is limited, but best I know a kernel launch requires at least two transactions: (1) host sending a kernel launch command to the GPU (2) GPU sending an acknowledgement back to the host.

I believe you are on Linux? On Windows, with the default WDDM driver, the launch overhead is larger. Because traversing the WDDM software stack is so expensive, the CUDA driver batches up launches to reduce the average launch overhead, but last time I measured on Windows 10 a couple years back that still averaged to 10+ microseconds.

Real kernels require more work than null kernels, as various additional control information has to first stuffed into, then taken out of the command queue (a software buffer on the host side that is drained via PCIe into the GPU’s hardware command queues), and configuring the GPU hardware for the launch may be more or less involved. The GPU command queues are processed at finite speed, and processing more control data takes additional time. I have not measured it recently and my memory is hazy, but this step takes single digit microseconds, I seem to recall.

Note that there could be issues with measurement frameworks. Maybe measurements are recorded with too coarse a resolution and microsecond delays cannot reliably be accounted for. Or there could be an issue of calibration, for example, when the basic 5 microsecond launch overhead is already subtracted out from the measured time (when measured from the host side) or invisible (when measured from the GPU side).

2 Likes

This is very helpful. Norbert, how many likes are you up to now?

Yes, I mean to say overhead but others have said latency in conversations around the office or different email threads. In any event, I am grateful to know the distinction between the mere fact of launching a kernel and then performing the additional work of the driver (or whatever mechanism) setting it up to actually perform the work. This all makes sense and it leads me to expect 5-10 microseconds per kernel launch.

Here is the way in which I am measuring the kernel execution time:

      timer->assignTime(0);

      // Test the valence kernel
      for (int i = 0; i < 1000; i++) {
        ctrl.step += 1;
        launchValenceDp(poly_vk, poly_rk, &ctrl, &poly_psw, &scw, &gmem_rval, eval_frc, eval_nrg,
                        purpose, launcher);
      }
      cudaDeviceSynchronize();
      timer->assignTime(sys_val_time);

There is a lot to explain there, but various objects are created on the CPU with GPU images of their data arrays, then I obtain “abstracts” of each object that contain pointers to the GPU memory. So ctrl is the abstract of mmctrl, a MolecularMechanicsControl object that really just contains counters. I am explicitly updating the ctrl.step field, which is an int within MolecularMechanicsControl and its abstract, and re-uploading this data as part of the kernel argument (just a convenient way to tell the kernel that the step number has advanced, which nudges it to use a different progress counter and periodically refresh a batch of progress counters depending on the kernel). The kernel launch is enclosed with the launchValenceDp function, and all it does is evaluate a couple of switches to get the appropriate launch parameters from the KernelManager object launcher and then fire off the corresponding variant of the kernel based on the flags to evaluate forces, energies, what to do with the particles at the end of the calculation (move them and wipe the global force array, or store the forces in the correct global array so that the particles can be moved later).

But, the point is, I run the kernel 1000 times, then do a device synchronization, in between two time assignments to my StopWatch object timer. The 0 category is “general stuff” while “sys_val_time” is the index I store for the category relating to this particular kernel test. I can add as many categories to that object as I want, and each can emit an integer for the most rapid access. The wall clock timer underlying StopWatch is the familiar <ctime> with microsecond accuracy, but again I’m running the kernel 1000x so the precision of the per-cycle execution should be quite high.

Since it is better to run code than engage in discussion based on memorized data, here is my program that measures per-kernel launch overhead with and without synchronization after every kernel. The time without synchronization is lower than what I remembered and expected.

It would be interesting to see the results from your system. You could perform some targeted experiments by trying different grid and block configurations, different number of kernel arguments, use of dynamic shared memory, effects of using managed memory, etc.

launch overhead WITHOUT per-kernel synchronization
1000000 launches. Elapsed = 3.056754 seconds: 3.056754 usec/launch

launch overhead WITH per-kernel synchronization
1000000 launches. Elapsed = 49.565085 seconds: 49.565085 usec/launch

The quick & dirty program I just wrote and that produced the above output:

#include <stdio.h>
#include <stdlib.h>

#define REPEATS       (2)
#define LAUNCHES      (1000000)
#define USEC_PER_SEC  (1000000)

// Timer with microsecond resolution
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

__global__ void kernel (void) {}

int main (void)
{
    double start, stop, elapsed;
    int i;

    printf ("\nlaunch overhead WITHOUT per-kernel synchronization\n");
    i = REPEATS;
    do {
        cudaDeviceSynchronize(); // wait until any pending activities complete
        start = second();
        for (int j = 0; j < LAUNCHES; j++) {
            kernel<<<1,1>>>();
        }
        cudaDeviceSynchronize();
        stop = second();
        elapsed = stop - start;
        i--;
        if (!i) printf ("%d launches. Elapsed = %.6f seconds: %f usec/launch\n",
                        LAUNCHES, elapsed, (elapsed / LAUNCHES) * USEC_PER_SEC);
    } while (i);
    printf ("\nlaunch overhead WITH per-kernel synchronization\n");
    i = REPEATS;
    do {
        cudaDeviceSynchronize(); // wait until any pending activities complete
        start = second();
        for (int j = 0; j < LAUNCHES; j++) {
            kernel<<<1,1>>>();
            cudaDeviceSynchronize();
        }
        stop = second();
        elapsed = stop - start;
        i--;
        if (!i) printf ("%d launches. Elapsed = %.6f seconds: %f usec/launch\n",
                        LAUNCHES, elapsed, (elapsed / LAUNCHES) * USEC_PER_SEC);
    } while (i);
    return EXIT_SUCCESS;
}

I do have a number of kernel arguments, but together they might add up to 1kB of data, which I think would be put into the __constant__ cache in the time takes the launch the kernel (essentially free, is what I’ve been told). Is it your understanding that dynamic __shared__ memory drives up the launch setup time? What about launching many smaller blocks (5 x 256 per SM) as opposed to a few larger ones (1 x 1024 per SM)?

Clearly my recollection of the basics did not match the current reality, and that would be even more true for details, which I last examined some ten years ago. If you run some targeted experiments (micro-benchmarks) you should have all the answers you need within a couple of hours.

1 Like

I will add these features to my main kernel benchmarking program, as a means of getting baseline results and then having them at hand to post alongside other results. The features will then be there to run on any GPU of interest, which may even inform purchasing decisions. Excellent idea!

Edit: on the A40, I am seeing a null kernel launch overhead of 2.75 microseconds. On a GTX 1080-Ti, remarkably, the overhead is a bit less, at 2.5 microseconds. Both results seem to be independent of the numbers of blocks launched, however the kernel gets to a total of 1024 threads per SM. When adding a “healthy” amount of kernel arguments, in total a little over 1kB like I think I’ve been passing to some of my real kernels, there is a 25% bump in the kernel launch overhead for GTX 1080-Ti (the time cost goes up to 3.1 microseconds), but a much less significant bump on A40 (2.9 microseconds, up from 2.75). So these are quite measurable things, but so far, it’s looking pretty good. And, quite amazing results on an RTX-2080Ti… 1.8 microseconds to launch a null kernel, 2.2 to launch an argument-loaded one.

I’m trying to test the latency of GMEM access by having threads read from one array and write to another. Doing this with just one integer per thread isn’t producing any signal–it takes about as long to launch (and complete) as the null kernel.

Thanks for the data. It is really good news that kernel launch overhead has been reduced this much compared to older platforms.

Do you have any PCIe Gen4 setup that you could test? Usually what happens with new generations of interfaces is that throughput goes up, but basic latencies remain more or less the same. DRAM is just one example. As the old quip goes: Money can buy bandwidth, but latency is forever.

But I thought I had read something that said that the writer saw reduced launch latency when using a PCIe Gen4 platform. Now I am wondering whether I actually read that or whether it is just a false memory.

The initial launch latency will be impacted by parameter size. You can test simply by modifying Norbert’s kernel to have 0-4KiB of parameters. Expect the impact to be size/4 clock cycles.

The shared memory size has a very small impact on the per SM per thread block launch latency. This will be counted in small number of nanoseconds and after the first wave is not likely to impact you.

The size of the kernel in terms of instructions does not impact the launch latency but it may impact the duration of the kernel due to instruction cache misses.

Launching warps is faster than launching thread blocks. Please note in your comment above you compared 5x256 vs. 1x1024. The added occupancy may improve performance. In addition if you have __syncthreads in a large thread block then you may full stall the warp schedulers. If you heavily use __syncthreads then I recommend more than 1 thread block per SM. You can determine if this is a problem by looking at the warp stall reasons counters in the profiler.

1 Like

This is very helpful, @Greg. In fact, it’s pretty much what I needed to know. In some of my kernels, I do make use of __syncthreads() on large blocks, so I’ll take a look at what it does to use two or four blocks of 512, 256 threads rather than one of 1024. In principle, I could break up the kernel more finely in general, and probably lose no more than 1% efficiency in the total arithmetic workload that needs to be performed (the system is broken up along various fault lines, and there are halo regions of additional particles around the group of particles that each work unit serves, so larger work units = more volume relative to the halo regions = more efficient in theory). But if the block-wide synchronization is more than a couple of percent, then reducing the work unit size is certainly an option. At present, my biggest impediment is that the blocks are WAY TOO BIG for the tiny systems that constitute one major use case, and the use case that would need the biggest work units is rather theoretical, so there’s a definite case for just making those blocks smaller.

Once upon a time (Fermi) we had GPUs with SMs that had a max of 1536 threads. After Fermi, until cc8.6 (Ampere GPUs that aren’t A100), all GPUs had either 1024 or 2048 max threads per SM. So there wasn’t first-order concerns around threadblocks of 1024 threads and occupancy.

With the advent of newer GPUs with max of 1536 threads per SM, I generally suggest that people may want to consider this in threadblock sizing. 1024 might be a less than optimal choice for some GPUs, right out of the gate, without any further analysis.

Yes, I noticed that about the RTX-3080 and RTX-3090. The A40 is still 2048, to my recollection, and we do compute on Enterprise cards at my company, although the GPUs we have in desktops around the office are an attractive resource if we could get people to use them judiciously. The 1024-threaded block is running near 64 registers per thread and its firing off an awful lot of atomic operations to __shared__ memory (which is better than to global addresses, but still not much fun). Beyond that, I’m happy that I seem to be making much better use of global memory bandwidth due to the way I organized the input data. An earlier incarnation of this kernel hit a memory bandwidth wall around 384 threads / SM (three blocks of 128 threads each). The fact that this one seems to be ducking the memory bandwidth bottleneck at 1024 threads per SM makes me smug, but there are other constraints keeping me from going (much) over 1024 threads per SM here.

A40 is a cc8.6 GPU (based on GA102) and it has a maximum of 1536 threads per block.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.