(1) What’s the operating system?
(2) What’s the GPU?
(3) If Windows, WDDM driver or TCC driver?
The minimal overhead of a kernel launch is around 5 usec, and achievable on Linux and Windows with TCC driver. With Windows and a WDDM driver (default), launches are batched to mitigate the massive overhead imposed by the WDDM driver model. That causes significant fluctuations in overhead and can drive launch overhead for some launches to 50 usec or so.
Launch times of 58 milliseconds would seem to indicate a saturation of the launch queue (which is quite deep). At least that’s the best idea I have right now, without access to your data.
Minimize launch overhead:
(1) Use Linux, or Windows with TCC driver (only some GPUs are supported!)
(2) Use a CPU with high single-thread performance, as the software portion of the launch overhead is serial CPU work. CPUs with base frequency >= 3.5 GHz will work well.