Negative timer values in 0.9

Hello,

I am running into a strange problem with the 0.9 release of CUDA, which I haven’t seen in 0.8. I am using a GeForce 8800 GTX and driver version 162.01 with VS03 under Windows XP.

I am timing a number of FFT and IFFT’s from the CUFFT library with the code below. With the use of a simple batchfile, this is performed for several data sizes (2x2, …, 1024x1024) to compare CUFFT with the FFTW library.

In the 0.8 release of CUDA and CUFFT everything worked fine, but in the 0.9 release I occasionally get negative timing values! Could this be a bug in the new CUDA release or am I doing something wrong?

//start actual calculation

CUT_SAFE_CALL(cutStartTimer(timer));

// Allocate device memory for signal

float* d_signal;

float* d_signal_f;

CUDA_SAFE_CALL(cudaMalloc((void**)&d_signal, mem_size));

CUDA_SAFE_CALL(cudaMalloc((void**)&d_signal_f, mem_size));

// Copy host memory to device

CUDA_SAFE_CALL(cudaMemcpy(d_signal_f, h_signal, mem_size_f, cudaMemcpyHostToDevice));

if (number >= 1)

{

	for (unsigned int i = 0; i < number; ++i)

	{

  // Transform signal and kernel

  CUFFT_SAFE_CALL(cufftExecR2C(plan_r2c, (cufftReal *)d_signal_f, (cufftComplex *)d_signal));

 // Transform signal back

  CUFFT_SAFE_CALL(cufftExecC2R(plan_c2r, (cufftComplex *)d_signal, (cufftReal *)d_signal_f));

	}

}

// Copy device memory to host

CUDA_SAFE_CALL(cudaMemcpy(h_signal, d_signal_f, mem_size, cudaMemcpyDeviceToHost));

CUT_SAFE_CALL(cutStopTimer(timer));

printf("%d FFT of %d processing time : %f (ms)\n", number, usedw, cutGetTimerValue(timer));

Thanks for the help,

Arno

Edit: added type of videocard, driver version, development environment and OS

What is probably causing your problem is that Nvidia have snuck a left shift of the timer register value into the code generated by ptxas in 0.9 and above so that the clock now wraps around twice as fast as it used to. Was close to 6 seconds and now 2.9 secs and so neg values occur for any measurements above 1.4 secs.
Eric

Ok, but I’m reading values within the range 0.3 to 500 ms so I’m not even close to 1.4 seconds.

Some typical measurements:

number of FFT's   computation time (ms)

100               0.69

100               0.70

100               0.71

100               0.70

100               0.70

100               0.71

100               0.70

100               0.71

100               0.71

100               0.71

101               1.07

101               1.06

101               1.08

101               1.08

101               1.08

101               1.06

101               1.06

101               1.08

101               1.08

101               1.08

102               0.73

102               0.73

102               0.72

102               0.72

102               0.73

102               0.74

102               0.73

102               0.73

102               0.74

102               0.73

103               1.09

103               1.09

103               1.09

103               1.09

103               1.10

103               1.09

103               -32.04

103               1.09

103               1.10

103               1.09

104               0.72

104               -32.64

104               0.71

104               0.72

104               0.72

104               0.71

104               0.70

104               -32.62

104               0.71

104               0.71

105               0.72

105               0.73

105               34.13

105               0.73

105               0.72

105               0.72

105               0.72

105               0.73

105               0.72

105               0.72

Update: CUDA uses the QueryPerformanceCounter under windows. This doesn’t work flawless on multicore systems (see http://support.microsoft.com//kb/896256 ). The update provided on that page helped for me. It reduced the problem significantly, but it wasn’t completely eliminated.