Why Cuda Kernel Launch Takes so much time ?

What is the reason that cause the kernel launch latency take too much time to be executed ? the max value can be more than 50ms.

And in some case ,both cpu and gpu are quite idle ,while the cudaLaunchKernel still takes much time.

A truly idle GPU and system should not experience a kernel launch latency of 50ms, and probably not even 50us.

For best case performance, the GPU must be idle. This means that the GPU is not supporting a display, and has no other workloads running on it. Any other workloads or display support can introduce more-or-less unbounded kernel launch latency. Since the kernel launch process begins on the CPU via a library call, it can also be important to make sure your CPU has sufficient idle capacity to allow for rapid performance in the launch process. Other applications that are running on the CPU resulting in a heavily loaded CPU can impact latency.

For example, if the GPU has a kernel running on it, that occupies the GPU fully, then any subsequent kernel launch cannot begin executing until that kernel finishes.

Managed memory on windows (or pre-pascal linux) can also impact kernel launch latency, because the kernel launch triggers migration of data, before the kernel can actually begin executing.

To explain what might be happening in your case, would require more details - approximately a full test case. The hardware and software platform you are running on, as well as a complete test code.

Alternatively, using a profiler such as nsight systems will likely yield useful information.