On single-GPU systems under 64-bit Linux I typically see launch overhead for empty kernels (i.e. no code and no kernel arguments) of less than or equal to 5 us. It seems to differ a bit based on GPU type, CUDA version, and host system; the lowest times I have observed were about 3 us.
From that perspective the reported launch overheads seem larger than expected. However, I have not run with a K10, measured launch overhead when more than one GPU is present in a system, or checked exactly how NVVP reports launch overhead. I have always used my own simple app for measuring launch overhead using batches of back-to-back kernel launches. I wouldn’t be surprised if NVVP needs to add instrumentation to launches which then increases the launch overhead, so you may want to run your own test.
I have never seen the compiler optimize out empty kernel launches. I am not sure this would be a good idea (for one thing, it would certainly eliminate a convenient way of measuring the minimum launch overhead :-)