since you launch an empty kernel with only 1 thread, you are measureing only part of the kernel launch overhead (which increases with the number of threads / blocks).
That time for example includes PCIe latencies and so on. I wouldnt necessarily expect the results to be different between a C2050, and a M1060.
I am trying to measure the improvements made on the Fermi architecture. They published on their website, that there was reduced kernel launch overhead and improved context switching (10 to 20 times faster). I am trying to verify that? Any ideas on how to that using timing operations ?
My vague recollection from when I was using a C1060 equivalent (the 8800 GTX) was that kernel launch times used to be tens of microseconds. The speed you are getting for the Fermi device seems to be fairly typical, so the real mystery is why your C1060 seems to be so fast.
If you are curious to understand the difference, I would try benchmarking your empty kernel with a launch configuration like <<<256, 256>>>. It might be in the more realistic scenario, the C1060 launch time grows faster.
The C1060 was a GT200 based design, so performance should be analogous to the GTX 280. But leaving that aside, even if Fermi turns out to be a lot faster than a GT200 with respect to kernel launch latency, there are so many host operating system and hardware characteristics that could easily mask performance differences between the cards. Your 6.5us number could easily be mostly due to operating system signal/interrupt, driver and PCI express bus latency not to any characteristic of the card itself. Reducing the latency performance of something that is not close to the latency bottleneck of the total system won’t have much effect on the overall latency you measure.