A frustrating question ! call for help About the cuda timer

The following is a part of the transpose code in cuda sdk example

// warmup so we don’t time CUDA startup
transpose_naive<<< grid, threads >>>(d_odata, d_idata, size_x, size_y);
transpose<<< grid, threads >>>(d_odata, d_idata, size_x, size_y);

int numIterations = 1;

printf("Transposing a %d by %d matrix of floats...\n", size_x, size_y);

// execute the kernel
cutStartTimer(timer);
for (int i = 0; i < numIterations; ++i)
{
    transpose_naive<<< grid, threads >>>(d_odata, d_idata, size_x, size_y);                   //first is naive transpose
}
cudaThreadSynchronize();
cutStopTimer(timer);
float naiveTime = cutGetTimerValue(timer);

// execute the kernel

cutResetTimer(timer);
cutStartTimer(timer);
for (int i = 0; i < numIterations; ++i)
{
    transpose<<< grid, threads >>>(d_odata, d_idata, size_x, size_y);                             //second is optimized transpose
}
cudaThreadSynchronize();
cutStopTimer(timer);
float optimizedTime = cutGetTimerValue(timer);

printf("Naive transpose average time:     %0.3f ms\n", naiveTime / numIterations);

printf(“Optimized transpose average time: %0.3f ms\n\n”, optimizedTime / numIterations);

I run the code and time the naive transpose and optimized transpose.
The naiveTime is about 17.6 ms, and the optimizedTime is 0.581 ms.

Then I change the code.

// warmup so we don’t time CUDA startup
transpose<<< grid, threads >>>(d_odata, d_idata, size_x, size_y);
transpose_naive<<< grid, threads >>>(d_odata, d_idata, size_x, size_y); // I change the order

int numIterations = 1;

printf("Transposing a %d by %d matrix of floats...\n", size_x, size_y);

// execute the kernel
cutStartTimer(timer);
for (int i = 0; i < numIterations; ++i)
{
    transpose<<< grid, threads >>>(d_odata, d_idata, size_x, size_y);                         //first is optimized transpose
}
cudaThreadSynchronize();
cutStopTimer(timer);
float optimizedTime = cutGetTimerValue(timer);

// execute the kernel

cutResetTimer(timer);
cutStartTimer(timer);
for (int i = 0; i < numIterations; ++i)
{
    Transpose_naive<<< grid, threads >>>(d_odata, d_idata, size_x, size_y);                // second is naive transpose         
}
cudaThreadSynchronize();
cutStopTimer(timer);
float naiveTime = cutGetTimerValue(timer);

printf("Naive transpose average time:     %0.3f ms\n", naiveTime / numIterations);

printf(“Optimized transpose average time: %0.3f ms\n\n”, optimizedTime / numIterations);

Then I run the code and time the naiveTime and optimizedTime.
What is a big surprise!
The naiveTime is about 8.23ms, and the optimizedTime is about 10.13ms. :unsure:
In my expectation, the results should be same.
Why it is not?
Call for help!!! Thank you very much!

Hi, this is my first post and I’m really new to CUDA, since I started

two days ago and there are a lot of things I still do not understand very well…

I believe the problem is a little bug in the program

(in the original program, not in your modification)

I think that one should add the call

cudaThreadSynchronize();

after the two preliminary "warmup’ calls to kernels,

otherwise one risks to timing also the warmups, due

to the non-blocking nature of kernel calls.

Adding the call to cudaThreadSynchronize()

the behaviour you noticed disappears.

giovanni