A frustrating question ! call for help About the cuda timer

xiaolaji · February 18, 2009, 2:30am

The following is a part of the transpose code in cuda sdk example

// warmup so we don’t time CUDA startup
transpose_naive<<< grid, threads >>>(d_odata, d_idata, size_x, size_y);
transpose<<< grid, threads >>>(d_odata, d_idata, size_x, size_y);

int numIterations = 1;

printf("Transposing a %d by %d matrix of floats...\n", size_x, size_y);

// execute the kernel
cutStartTimer(timer);
for (int i = 0; i < numIterations; ++i)
{
    transpose_naive<<< grid, threads >>>(d_odata, d_idata, size_x, size_y);                   //first is naive transpose :rolleyes: 
}
cudaThreadSynchronize();
cutStopTimer(timer);
float naiveTime = cutGetTimerValue(timer);

// execute the kernel

cutResetTimer(timer);
cutStartTimer(timer);
for (int i = 0; i < numIterations; ++i)
{
    transpose<<< grid, threads >>>(d_odata, d_idata, size_x, size_y);                             //second is optimized transpose :rolleyes: 
}
cudaThreadSynchronize();
cutStopTimer(timer);
float optimizedTime = cutGetTimerValue(timer);

printf("Naive transpose average time:     %0.3f ms\n", naiveTime / numIterations);

printf(“Optimized transpose average time: %0.3f ms\n\n”, optimizedTime / numIterations);

:mellow: I run the code and time the naive transpose and optimized transpose.
The naiveTime is about 17.6 ms, and the optimizedTime is 0.581 ms.

Then I change the code.

// warmup so we don’t time CUDA startup
transpose<<< grid, threads >>>(d_odata, d_idata, size_x, size_y);
transpose_naive<<< grid, threads >>>(d_odata, d_idata, size_x, size_y); // I change the order :rolleyes:

int numIterations = 1;

printf("Transposing a %d by %d matrix of floats...\n", size_x, size_y);

// execute the kernel
cutStartTimer(timer);
for (int i = 0; i < numIterations; ++i)
{
    transpose<<< grid, threads >>>(d_odata, d_idata, size_x, size_y);                         //first is optimized transpose :rolleyes: 
}
cudaThreadSynchronize();
cutStopTimer(timer);
float optimizedTime = cutGetTimerValue(timer);

// execute the kernel

cutResetTimer(timer);
cutStartTimer(timer);
for (int i = 0; i < numIterations; ++i)
{
    Transpose_naive<<< grid, threads >>>(d_odata, d_idata, size_x, size_y);                // second is naive transpose     :rolleyes:      
}
cudaThreadSynchronize();
cutStopTimer(timer);
float naiveTime = cutGetTimerValue(timer);

printf("Naive transpose average time:     %0.3f ms\n", naiveTime / numIterations);

printf(“Optimized transpose average time: %0.3f ms\n\n”, optimizedTime / numIterations);

:rolleyes: Then I run the code and time the naiveTime and optimizedTime.
What is a big surprise!
The naiveTime is about 8.23ms, and the optimizedTime is about 10.13ms. :unsure:
In my expectation, the results should be same.
Why it is not? :wacko:
Call for help!!! Thank you very much!

computerulz · February 18, 2009, 5:57am

Not 100% sure, but I think to ‘warm up’ the CUDA device, you can only use one kernel at a time. i.e. warm up one kernel, then execute it, then warm up the second kernel, and execute that one.

MisterAnderson42 · February 18, 2009, 2:33pm

you need cudaThreadSychronize calls before every call to cudaStopTimer() and cudaStartTimer(), otherwise you have no clue what you are actually timing.

xiaolaji · February 18, 2009, 3:07pm

Thank you for your advice! External Media

I will change the code and try it again!

paulius · February 18, 2009, 8:58pm

It’s much better to use CUDA events for timing kernels.

Paulius

Topic		Replies	Views
A frustrating question ! call for help About the cuda timer CUDA Programming and Performance	1	2694	February 19, 2009
speed not stable,and performance lost Maybe a HUGE bug CUDA Programming and Performance	6	10047	November 29, 2007
Time measurement CUDA Programming and Performance	2	1222	September 13, 2009
Timing the code CUDA Programming and Performance	5	4796	July 28, 2011
problem in timing of GPU work CUDA Programming and Performance	5	899	September 11, 2015
how to evaluate the CUDA's performance how can i know the program is optimazed CUDA Programming and Performance	7	7462	July 24, 2008
about __syncthreads() in SDK/project/transpose CUDA Programming and Performance	5	2791	September 18, 2009
Events vs Timers - big differences measurung kernel execution time CUDA Programming and Performance	7	2248	December 21, 2010
the same thing, different time consuming asking for help CUDA Programming and Performance	5	6315	May 26, 2009
Timers not timing... CUDA Programming and Performance	7	4663	November 8, 2008

A frustrating question ! call for help About the cuda timer

Related topics