Launch of many small kernels 10x slower compared to one kernel

I have a CFD application that we try to accelerate on GPUs. I have written a solver that performs well (0,2 ns/cell update on a H100 for a full 3D hydro finite volume Riemann solver).

When I run on a synthetic problem with 512^3 unigrid I can fill up the GPU and I get amazing performance of 0,2 ns / cell update. The solver is written with a marching plane, and the domain is tiled into tiles of 8x8 in size and arbitrary height. This makes it possible to put all scratch variables in the shared memory of the SM.

When I cut up my grid in to 32^3 patches of 16^3 so that the memory is organised as (N,N,N,Nvar,Npatch), where N is 16+4 (4 for 2+2 boundary zones) and Npatch is 32^3 and I launch it as a single kernel with one of the block dims being Npatch I again get the same performance.

The problem is when I try to launch one kernel per patch. Even when I multiplex it over 128 streams and make all launches asynchronous the performance drops by more than a factor of 10 and the cost is 3 ns/cell update.

My estimate is that a single 16^3 patch is being executed on 4 SMs and cost approx 25 microseconds to compute. Is the cost of launching a kernel really 10 times that (e.g. 250 microseconds) when done completely asynchronous and is it possible to reduce it?

The backdrop is that in the production application we have and adaptive mesh of patches that is executed fully multithreaded and with possibility of detaching CPU threads so the natural workload size is a single patch of 16^3 elements and a few CPU threads can launch thousands of patches per second.

Typically I would expect the kernel launch overhead to be in the 25us range, not 250us range. It can be influenced slightly by grid size, parameter pack size, etc. but in my experience none of these would result in a 10x larger factor than 25us. And in some cases you can measure a kernel launch overhead as small as maybe 5us. In my experience windows WDDM vs. Linux (or TCC) can also be a factor - windows being somewhat longer, but not 10x.

Other factors may be affecting performance. For example if shared memory usage is efficiently done at the larger scale, then you might not get as much benefit at the smaller scale.

A typical recommendation to reduce kernel launch impact for smaller kernels is the use of CUDA Graphs.

As a general rule, though, taking a larger kernel and breaking it into a bunch of smaller kernels is usually not a good idea, if the size of the smaller kernels is less than what can fill the GPU entirely.

Also, a profiler may help to shed light on the differences.

Thanks for the super quick answer. It all runs on Linux. The results are from a synthetic mini-app.

In the production code it is a multi-threaded task driven environment that update patches dynamically; something like this

!$omp task

pre update code

Transfer of patch variables from host to GPU
KERNEL LAUNCH
Transfer of patch variables from GPU to host
!$omp taskyield
cudasync

post update code

!$omp end task

Doing my math correctly and correcting a few mistakes I actually find something close to 5 mus for the “execution time” of each kernel when launching the 16^3 patches, which may be completely dominated by kernel launch then. Is there any way to decrease launch time, or do I need to construct larger kernels?

If you are getting 5us (microseconds) per kernel, you are doing “well” in my opinion. I don’t mean “well” in terms of overall performance, I mean “well” in terms of the lower bound of what is achievable for measuring the duration of a single kernel - even an empty kernel.

As a practical matter, it is hard to improve on that. You could try CUDA graphs, although I don’t know if it will fit well with your dynamic work creation scheme. It might.

Other than that I have no suggestions to improve performance of such small kernels (you already mentioned using multiple streams, etc.). A kernel that only occupies a few SMs is quite small on a GPU that has 40 or more SMs.

I haven’t looked into what CUDA graphs look like in CUDA Fortran. If you get to that point, you may get better help on this forum, where the majority of the CUDA Fortran questions are.

Is there any way to combine several of those kernels?

What is the limiting factor here?

  • thread synchronization as they are on different host threads
  • latency - the host cannot wait for other ready patches
  • different launch configurations between patches
  • something else?

Constructing larger kernels is highly recommended. IMHO, ideally kernel execution times should range from around 1 to 200 milliseconds. Depending on use case, deviations from this range will be common.

The launch overhead is largely a function of hardware, in particular the PCIe interconnect and associated PCIe interfaces being used. There is a contribution from software in the form of driver execution time. The software overhead is dwarfed by the hardware overhead except when WDDM drivers are used. Generally speaking, software overhead is strongly inversely correlated with single-thread performance of the host CPU. I recommend CPUs with base frequency of >= 3.5 GHz for best results.

While one might expect newer PCIe versions to provide an advantage in terms of reducing the hardware overhead, new PCIe generations are large focused on increasing bandwidth, bringing with them only modest reductions in latency. However, from what I have seen over the past 15 years, the minimum achievable kernel launch overhead has decreased from about 5 microseconds to around 2.5 to 3 microseconds with the latest hardware (using components that support PCIe5).

Thanks to all for good suggestions and insight.

For the curious, the code we are working to get GPU parallelism is called DISPATCH. This discussion is about a mini-app to investigate different strategies for the larger code.

I clearly need kernels that take longer to execute.This can be accomplished with

  • bunch the execution of patches together.
  • having more cells per patch (which somewhat decreases the efficiency of the code, because the adaptive resolution of the grid will be more coarse grained and a simulation will run with more cells). Concretely, going from 16^3 patches to 24^3 patches increases number of physical cells by a factor 3,3
  • having more complicated physics. In these tests I am doing pure hydrodynamics. In a real-world code I will run magneto-hydrodynamics. MHD is ≈5 times more expensive than HD.

Using my mini-app HD test with single precision and (512)^3 total domain I have tried five options

i) Patch size is (16)^3 and all 32768 patches are calculated by a single kernel. This is a baseline test for best performance.
ii) Increase my patch size to (40)^3, to make the workload 15 times bigger per patch and submit one kernel per patch. This is closer to a real-world experiment where patches are calculated dynamically from a task queue.
iii) Same as (ii), but submit kernels from 16 different threads using OpenMP.
iv) Bunch patches of size (16)^3 together in 256 patches a bunch to reduce overhead.

v) Bunch patches of size (16)^3 together in 512 patches a bunch to reduce overhead further.

With a A6000 card:

i) 0,45 ns/cell
ii) 0,53 ns/cell
iii) 0,58 ns/cell.

With a H100 card:

i) 0,19 ns/cell
ii) 0,28 ns/cell
iii) 0,4 ns/cell
iv) 0,25 ns/cell
v) 0,2 ns/cell

In both “per-patch” submission cases the execution is completely asynchronous, so the overhead is entirely due to the extra kernel launches.

A few observations:

  • on a H100 more patches can run concurrently and the bottleneck of kernel submission is larger.

  • kernel launch from more than one CPU thread leads to extra overhead if the kernel launch queue is already contested. This does not seem to have improved between A6000 and H100.

  • Not reflected in the above numbers, but for test (iv) and (v) I tried running with 1 CPU thread or 16 CPU threads and it does not make a difference. E.g. kernel launch is not contested.

I had a small hope that part of the overhead was on the CPU and doing it multithreaded would decrease the overhead, but it seems to be the opposite.

Ideally, the best strategy is probably to delay the calculation of a patch and store patches in “bunches” of N patches and whenever a bunch is full, then trigger calculation of all patches, but it does entail some extra management overhead of tracking bunches. If patch data is already on the GPU this can be accomplished by simply creating a list with pointers or array indices to the patches.

Best

Troels

1 Like

It’s good to read that you are gaining valuable insights and making meaningful progress.

That is as I would expect it. Generally speaking, every time multiple agents act upon a single resource synchronization needs to happen somewhere. I don’t know how exactly the test case was set up, but a classical and often superior arrangement for multi-threaded operation is to have one designated thread communicating with the GPU, while all other threads communicate with this designated thread. This is intended to keep the overhead away from the critical shared resource by potentially using up host CPU additional time but letting the GPU operate as efficiently as possible.

1 Like