I have a CFD application that we try to accelerate on GPUs. I have written a solver that performs well (0,2 ns/cell update on a H100 for a full 3D hydro finite volume Riemann solver).
When I run on a synthetic problem with 512^3 unigrid I can fill up the GPU and I get amazing performance of 0,2 ns / cell update. The solver is written with a marching plane, and the domain is tiled into tiles of 8x8 in size and arbitrary height. This makes it possible to put all scratch variables in the shared memory of the SM.
When I cut up my grid in to 32^3 patches of 16^3 so that the memory is organised as (N,N,N,Nvar,Npatch), where N is 16+4 (4 for 2+2 boundary zones) and Npatch is 32^3 and I launch it as a single kernel with one of the block dims being Npatch I again get the same performance.
The problem is when I try to launch one kernel per patch. Even when I multiplex it over 128 streams and make all launches asynchronous the performance drops by more than a factor of 10 and the cost is 3 ns/cell update.
My estimate is that a single 16^3 patch is being executed on 4 SMs and cost approx 25 microseconds to compute. Is the cost of launching a kernel really 10 times that (e.g. 250 microseconds) when done completely asynchronous and is it possible to reduce it?
The backdrop is that in the production application we have and adaptive mesh of patches that is executed fully multithreaded and with possibility of detaching CPU threads so the natural workload size is a single patch of 16^3 elements and a few CPU threads can launch thousands of patches per second.