I do not use Windows 10 but I seem to recall reading elsewhere that the overhead problem has gotten slightly worse with WDDM 2.0 used by Windows 10. If your GPU allows you to make use of the TCC driver (not sure whether that is supported for OpenCL!), I would suggest using that as it eliminates the WDDM overhead issues.
I do not know how closely the Direct3D command issue mechanism is related to CUDA’s or OpenCL’s kernel launch mechanism; other than the general technique of placing commands and data in a push buffer, they may not have much in common.
As I said, the launch batching used by the CUDA driver (and, by extension, probably OpenCL) to lower the average launch overhead with WDDM can lead to spikes in the latency for a particular kernel launch. Depending on your timing methodology, you may be picking up such peaks. To get a better idea, you may want to
(1) use a timer with at least microsecond resolution
(2) do a warmup run before starting to measure (standard procedure for all benchmarks)
(3) use null-kernels (empty kernel that does not do anything)
(4) issue thousands of kernels back to back, tracking maximum, minimum, and average launch overhead
You may well find that the minimum overhead is much closer to the “ideal” 5-7 microsecond range I stated (e.g. 10-20 microsecond) than what you are currently measuring.
In general, launch overhead is not something you can do something about as a programmer. NVIDIA is well aware of the issue of launch overhead and tries to address it (e.g. the batching already mentioned). The CUBLAS and CUFFT libraries (and possibly others) shipping with CUDA have batch interfaces to support work on large-ish sets of small data items.
Generally speaking, if you have a fast CPU, properly vectorized and threaded CPU code, and the active data set can fit into the CPU’s last-level cache, it is often not worthwhile to attempt GPU processing. Likewise, it you need low-latency (rather than high throughput) processing, e.g. for high-frequency trading, doing the processing on the CPU may well be the best solution. GPUs are great for particular kinds of processing, but they don’t make CPUs obsolete. Instead, hybrid processing allows programmers to harness the strength of each (GPU and CPU), which is why I usually recommend pairing high-end GPUs with high-frequency (>= 3.5 GHz) quad-core or hexa-core CPUs.