If using Tesla GPUs with the TCC driver on Windows 7/8 is not an option, you may want to consider using Linux. I have an older dual boot Linux / Windows XP machine, and therefore am aware that the Windows XP driver and the Linux driver have the same low overhead. I would not expect Windows XP to be a viable option for most at this time.
My understanding is that the launch batching performed by the CUDA driver when running with WDDM significantly reduces the average launch overhead, but it is expensive to kick off each batch. I do not know what the average cost is, so I can’t say whether the 139 usec reported above relates to the cost of kicking off a batch of launches or the average cost of a launch. It might be worth an experiment to establish the maximum kernel launch rate (e.g. of empty kernels) when using the WDDM driver.
I ran some quick experiments on a reasonably high-end Windows7 machine, with a 3.5 GHz Sandybridge CPU and a Tesla C2050, using the latest drivers. Averaging across 500 kernel launches, I see kernel launch overhead of 3-10 usec per launch. If I add a call to cudaThreadSynchronize() after each kernel call it takes 50-100 usec on average.
By comparison, I measured the average kernel launch overhead to be 3.7 usec under RHEL Linux, and 16 usec with cudaThreadSynchronize() added. Under Windows XP64 the numbers are 3.5 usec and 14 usec. These are, as stated, averages. I did not have time to look at the distribution. I would expect fairly wide variance for individual launches, so when assessing the kernel overhead it would be best to look at a larger collection of kernel launches, rather than an indiviual one.
“Your mileage many vary”, that is, you may see different overhead on your system depending on the performance of the CPU, the GPU(s) in the system, the driver version, and probably a bunch of other factors.
It seems that synchronization adds a relativly large amount of overhead with the WDDM driver model, so in practical terms one would want to minimize the amount of synchronous / synchronizing API calls in a CUDA application. I would consider that a “best practice” anyhow.