Kernel launch times around 5µs for null kernels (kernels that do nothing) on PCIe gen3 hardware have been the standard for the past decade. It is my understanding that this is primarily a function of various hardware latencies associated with PCIe, with some minor impact from CPU performance in the form of CUDA driver overhead. There are some additional overhead issues on Windows with WDDM driver, where the CUDA driver uses launch batching. Are you on Linux or Windows?
Assuming this is Linux: How confident are you about the measurement methodology that found a minimum launch time of 770 ns? I find that number to be improbably low, but I do not have access to a system with the latest PCIe gen4 hardware, so maybe this is a thing now. If you are on Windows with a WDDM driver, the short and long launch times observed are likely artifacts caused by batching. Consider switching to the TCC driver if possible.
Because of the kernel launch overhead it is, generally speaking, not a good idea to use extremely short running kernels. Kernel runtime of 10 ms on the fastest GPU models of a GPU generation (which then run around 100ms on the slowest GPUs of that generation) seem like a good target which practically eliminates any impact from kernel launch overhead.