problem with clock

Hi!

I have the following problem:

I wrote this code:

__host__ void CuAddVector(float *_p1, float *_p2, unsigned int _uiSize, float *_pResult, unsigned int *_ulTime)

{

	unsigned int _uiStopTime, _uiStartTime;		

	

	//allocate device memory (graphics card)

	float *_fCuda1, *_fCuda2, *_fCudaResult;

	_uiStartTime = clock();

	cudaMalloc((void**)&_fCuda1, _uiSize*sizeof(float));

	cudaMalloc((void**)&_fCuda2, _uiSize*sizeof(float));

	cudaMalloc((void**)&_fCudaResult, _uiSize*sizeof(float));

	//copy input data from host memory (RAM) to device memory (graphics card)

	cudaMemcpy(_fCuda1, _p1, _uiSize*sizeof(float), cudaMemcpyHostToDevice);

	cudaMemcpy(_fCuda2, _p2, _uiSize*sizeof(float), cudaMemcpyHostToDevice);

		

		

	//vector addition (only one thread/kernel)

	globalAddVector<<<dim3(1,1,1),dim3(1,1,1)>>>(_fCuda1, _fCuda2, _uiSize, _fCudaResult);

	//copy output data from device memory (graphics card) to host memory (RAM)

	cudaMemcpy(_pResult, _fCudaResult, _uiSize*sizeof(float), cudaMemcpyDeviceToHost);

	_uiStopTime = clock();

	if (_uiStopTime >= _uiStartTime)

		*_ulTime = _uiStopTime - _uiStartTime;

   else

		*_ulTime = _uiStopTime + (0xffffffff - _uiStopTime);

}

P1 and P2 are arrays I ant to add, uiSize is the size of both. The function globaladdvector adds them (using one thread). With the difference of uiStartTime und uiStopTime I want measure the GPU cycles. But independent of the array size (I tested values between 10 and 10000) ulTime is always about 60 - 80! What is the error?

And a second short question: By what does the function

cudaMemcpy(_pResult, _fCudaResult, _uiSize*sizeof(float), cudaMemcpyDeviceToHost);

know, that the function ‘globalAddVector’ is ready?

Please help me!

For array size = 10 the result is sometimes 0. That is impossible, isn’t it?

Are you sure the kernel is actually launching?

Add this line after the kernel invocation

printf("%s\n",cudaGetErrorString(cudaGetLastError()));

You might need to #include <cutil.h> for cudaGetErrorString

cudaMemcpys are synchronous by default and simply wait until all kernels called before them finish.

By the way, you forgot to cudaFree what you’ve cudaMalloc’d

EDIT: remember that clock() has a fairly low resolution. It’s possible for it to report 0 if the time taken was smaller than its resolution.

Thank you very much for your answer!

I added

printf("%s\n",cudaGetErrorString(cudaGetLastError()));

and it returns ‘no error’. In addition _pResult shows the right result. In debbuging mode I can see that clock() returns for StartTime and Stoptime the same value (sometimes), else the difference is about 60 - 80 independent from the arraysize (size = 100, size = 1000, size = 10000).

I added cudaFree(), thanks!

Now, I work with the api - functions cutCreateTimer, cutStartTimer, cutStopTimer, cutGetTimerValue, cutDeleteTimer (shown in the sdk clock example) and it works fine. External Image

Hi schmeing,

I think that i have the same problem. I want measure the GPU cycles fo a matrix multiplication but the result that i obtain with your code is always time=0.
If you have solved your problem thanks to tell me what i can solve it.
Thanks for help External Image

Do you use clock() or something more sophisticated?

Yes I use the clock function.

StartTime = clock();

StopTime = clock();

printf(“%f”,stopTime-StartTime);

The clock() function doesn’t have the required temporal resolution. If time taken is less than, for example, a couple of ms then it will register 0.

The solution is to use better timers.

The CUDA clock() function has a rather impressive sub-microsecond resolution… it’s actual clock ticks of the GPU. It has a slight deficiency in that the ticks are not measures of time, but of cycles. But this means that the timings are mostly independent of GPU frequency. The other common problem and frustration is that the super-fine resolution of the function means the value rolls over in just two or three seconds.

I use the clock() function here to measure the throughput of CUDA operations.

The CUDA clock() function has been invaluable to me for manual profile instrumentation of my more complex kernels… it’s easy for a running kernel to keep checkpoints of its time allocations for different subtasks.

There is also a host function called clock() defined time.h. It has terrible resolution, about 1ms, and I think it’s even 1/60 of a second on the Mac. This is likely what you’re thinking of when you say clock() has poor resolution.