Tesla C2050 and Tesla M1060

I am trying to time kernel launch overheads on both Fermi-C2050 and M1060. This is the code I am using:

   for (int xv = 0; xv < 100000; xv++) {
cudaEventRecord(start, 0);
cudaEventRecord(stop, 0);
cudaEventElapsedTime(&x_time, start, stop);
time += x_time;	
time /= 100000;

printf(“Time is %0.10f\n”, 1000.0 *time);

And the timings I get are about 6.5us on both systems. Shouldn’t they be different, in that Fermi should be faster ? Am I doing anything wrong ??


since you launch an empty kernel with only 1 thread, you are measureing only part of the kernel launch overhead (which increases with the number of threads / blocks).

That time for example includes PCIe latencies and so on. I wouldnt necessarily expect the results to be different between a C2050, and a M1060.



I am trying to measure the improvements made on the Fermi architecture. They published on their website, that there was reduced kernel launch overhead and improved context switching (10 to 20 times faster). I am trying to verify that? Any ideas on how to that using timing operations ?

Thank you,

My vague recollection from when I was using a C1060 equivalent (the 8800 GTX) was that kernel launch times used to be tens of microseconds. The speed you are getting for the Fermi device seems to be fairly typical, so the real mystery is why your C1060 seems to be so fast.

If you are curious to understand the difference, I would try benchmarking your empty kernel with a launch configuration like <<<256, 256>>>. It might be in the more realistic scenario, the C1060 launch time grows faster.

The C1060 was a GT200 based design, so performance should be analogous to the GTX 280. But leaving that aside, even if Fermi turns out to be a lot faster than a GT200 with respect to kernel launch latency, there are so many host operating system and hardware characteristics that could easily mask performance differences between the cards. Your 6.5us number could easily be mostly due to operating system signal/interrupt, driver and PCI express bus latency not to any characteristic of the card itself. Reducing the latency performance of something that is not close to the latency bottleneck of the total system won’t have much effect on the overall latency you measure.

Thank you all,

I have increased the loop size to 1000000, and modified the kernel launch to have 512 threads;


Re timed the code, and this time reassessed with Geforce 8400GS too.

C2050 -> 5.7us
M1060 -> 7.9us
8400GS -> 84.7us

Oops, right. I confused the C1060 with the C870.