Concurrent Kernel Launching to Hide Kernel Launching Overhead (Not only kernel execution))

Recently I saw a StackOverflow post which shows huge kernel launching overhead when launching relatively small kernels. (https://stackoverflow.com/a/55898876), here’re the profiling results:

I wonder if kernel launch overhead in the CPU thread can be hidden when launching these kernels with different threads (using the same CUDA context of course).
Is such overhead only occurs in the CPU or some kind of special component in GPU hardware is also involoved.