How to calculate the speedup ratio between C code and CUDA program?

I have C code and CUDA program, both of them do the same thing.
How can I know the speedup performance?
Except of using clock() command of C, is there any other way to compare their speed performance.


Well if you got a speed up of x10-x100 then I guess your hand watch should suffice ;)

You can also use the timer functions in cutil such as : cutCreateTimer, cutResetTimer, cutStartTimer, cutStopTimer

I use the following:

#define TIMER_START( iTimer ) do {	\

	CUT_SAFE_CALL( cutCreateTimer( &iTimer ) ); \

	CUT_SAFE_CALL( cutResetTimer( iTimer ) );	\

	CUT_SAFE_CALL( cutStartTimer( iTimer ) );	\

	} while ( 0 ) \

#define TIMER_STOP( iDeviceId, iTimer, sTimerMsg ) do { \

	float fKernelTimer = 0; \

	char buffLogData[ 1000 ]; \

	CUT_SAFE_CALL( cutStopTimer( iTimer ) );	\

	fKernelTimer = cutGetTimerValue( iTimer );	\

	sprintf_s( buffLogData, "%s: [%0.3f] ms\n", sTimerMsg, fKernelTimer );	\

	LogGPUData( iDeviceId, buffLogData );	\

	CUT_SAFE_CALL( cutDeleteTimer( iTimer ) ); \

	} while ( 0 ) \

Or you can use cudaEvent.

Thanx for all the reply above.

By the way, I’ve seen many paper plot figures using GFLOPS as coordinate axis when comparing the performance between GPU and CPU.
How can I know how many floating operations in my program? Is there any command to use?

I think you have to either count the logical floating-point operations yourself by code inspection, or look at the ptx assembly itself.

Note that one can also use the event management functions already provided with CUDA for precise (sub-millisecond) benchmarking. Below shows some sample code demonstrating this:

float       dt_milliseconds;
cudaEvent_t start,


cudaEventRecord(start, 0);

/* Execute CUDA commands/call kernel, or some serial code */

cudaEventRecord(stop, 0);

cudaEventElapsedTime(&dt_milliseconds, start, stop);
printf("Elapsed time = %4.5f milliseconds.\n", dt_milliseconds);