A strange timing program, call for help!

Recently , I composed the code of gabor filter( a type of filter) using cuda. And I want to filter the image with the filter in real time.
After I finished composing the filter, I calculated the time needed to implement this filter.
The code is like the following paragraph:
Void gabor(… )
{
/* parameter setting using C++ language*/

CUDA_SAFE_CALL( cudaThreadSynchronize() );
CUT_SAFE_CALL( cutResetTimer(hTimer) );
CUT_SAFE_CALL( cutStartTimer(hTimer) );

…(many kernels here)

CUDA_SAFE_CALL( cudaThreadSynchronize() );
CUT_SAFE_CALL(cutStopTimer(hTimer));
gpuTime = cutGetTimerValue(hTimer);


}
It just need about 3 millisecond to implement this filter. :rolleyes:
But when I time it in the main function using the C++ function difftime.
The code is like the following paragraph.
Void main()
{
/parameter setting/
time_t start,end;
int numCirculation = 1000;
time (&start);

for(i=0; i < numCirculation; i++)
{
Gabor(…….);
}

time (&end);
dif = difftime(end,start);
dif = dif / numCirculation;


}

It needs 40 millisecond to implement the gabor fitler. :X

Why there exist so large difference between the time?

I’m a beginner of cuda. I really need your help!

Thank you very much!

If you are using windows, try using QueryPerformanceCounter and QueryPerformanceFrequency to time your code, or gettimeofday() on linux, as the time function is only accurate to seconds. With these timers, you should be able to get a time for a single call, without needing the loop repetitively calling your filter.

EDIT: Also, if there is anything happening before you start or after you stop the cuda timer, there will be variation in the results.