A strange timing program, call for help!

Recently , I composed the code of gabor filter( a type of filter) using cuda. And I want to filter the image with the filter in real time.
After I finished composing the filter, I calculated the time needed to implement this filter.
The code is like the following paragraph:
Void gabor(… )
{
/* parameter setting using C++ language*/

CUDA_SAFE_CALL( cudaThreadSynchronize() );
CUT_SAFE_CALL( cutResetTimer(hTimer) );
CUT_SAFE_CALL( cutStartTimer(hTimer) );

…(many kernels here)

CUDA_SAFE_CALL( cudaThreadSynchronize() );
CUT_SAFE_CALL(cutStopTimer(hTimer));
gpuTime = cutGetTimerValue(hTimer)

}
It just need about 3 millisecond to implement this filter. :rolleyes:
But when I time it in the main function using the C++ function difftime.
The code is like the following paragraph.
Void main()
{
/parameter setting/
time_t start,end;
int numCirculation = 1000;
time (&start);

for(i=0; i < numCirculation; i++)
{
gabor(…….);
}

time (&end);
dif = difftime(end,start);
dif = dif / numCirculation;


}

It needs 40 millisecond to implement the gabor fitler. External Image

Why there exist so large difference between the time?

I’m a beginner of cuda. I really need your help!

Thank you very much!