Timing using cudaEvent****() VS clock_gettime()

Hi all !
I’m trying to time my code. I have been told that cudaEvent****() can be applied. at the meantime, my original code use clock_gettime() to time. I print results measured by both cudaEvent****() and clock_gettime() as follows. That is what I’m really confused about.

measured by cudaEvent****()
init data structure: 1971.517578ms
establish context: 0.007296ms
rearrange data: 234.271423ms
copy data: 53.402176ms
time stepping: 17221.333984ms

measured by clock_gettime()
init data structure: 1.802874s
establish context: 20.541891s
rearrange data: 0.235464s
copy data: 0.051851s
time stepping: 8.429955s

note:
init data structure: totally work on CPU
establish context: one line only: cudaFree((void*)0);
rearrange data: totally work on CPU
copy data: transfer data from host to device.
time stepping: two kernel functions are involved

Q1: The time spent of “establish context” measured by cudaEvent****() (0.0072ms) is quite different from that measured by clock_gettime() (~20.5s). Actually, this part has only one line which establishes a context.
cudaFree(0)
How does this vast difference happen ?

Q2: The time spent of “time stepping” measured by cudaEvent****() (~17.221s) is twice as much as that measured by clock_gettime() (~8.43s). Someone tells me that asynchronization can be a possible reason, but I don’t really get it. can anyone help me get through it ?

Q3: The wall clock time spent is really close to the time measured by clock_gettime(). However, I’m told that cudaEvent****() is preferable in timing a cuda code. I don’t know which one I should shoose.

what exactly is cudaEvent****()?

and how exactly did you use it?

In the kernel you can access each SM’s hardware clock, using clock() or clock64()
BUT these clocks are not syncronised, ie each SM has its own independent clock.
Also each GPU has its own details. Except from C2050 (and perhaps others?)
device reset appears to re-synchonise all the clocks.

There is some example code which calls clock(), passes its results back to the host
and then the host estimates the total kernel elapse time from the worst case (longest)
SM elapse time, taking care of where the clock overflows 32-bits.
Look for the files fixed.cu and fitness.c in
http://www.cs.ucl.ac.uk/staff/W.Langdon/ftp/gp-code/barracuda_gp.tar.gz

Bill

The following is a part of my code in which some timing functions and macros are defined.

#define TIMING 1

#if TIMING
double get_time()
{
	struct timespec time;
	clock_gettime(CLOCK_REALTIME, &time);
	return (double)time.tv_sec + (double)time.tv_nsec * 1.0e-9 ;
}
#endif

#define CUDATIMING 0

#if CUDATIMING
#define cuda_timing_init \
	cudaEvent_t startEvent, stopEvent;\
	float timeEvent;\
	cudaEventCreate(&startEvent);\
	cudaEventCreate(&stopEvent);

#define cuda_timing_begin \
	cudaEventRecord(startEvent, 0);

#define cuda_timing_stop(str) \
	cudaEventRecord(stopEvent, 0);\
	cudaEventSynchronize(stopEvent);\
	cudaEventElapsedTime(&timeEvent, startEvent, stopEvent);\
	printf("time spent of %s: %fms\n", str, timeEvent);

#define cuda_timing_destroy \
	cudaEventDestroy(startEvent);\
	cudaEventDestroy(stopEvent);
#endif

I use these functions and macros to time. For example, to time “establish context” part, the code below is applied

#if TIMING
	double t1 = get_time();
#endif

#if CUDATIMING
	cuda_timing_init;
	cuda_timing_begin;
#endif

	// choose device / establish context
	// ...
	cudaFree((void*)0);

#if TIMING
	double t2 = get_time();
#endif

#if CUDATIMING
	cuda_timing_stop("choose device / establish context");
	cuda_timing_destroy;
#endif

Please tell me if there is anything wrong. I really appreciate it.

Thanks for your reply. I’ll try to figure this out. Nevertheless, I still want answers to my questions, If you have some more ideas or suggestions, don’t hesitate to tell me !

Q1: The time spent of “establish context” measured by cudaEvent****() (0.0072ms) is quite different from that measured by clock_gettime() (~20.5s). Actually, this part has only one line which establishes a context.
cudaFree(0)
How does this vast difference happen ?

cudaEventElaspsedTime() ties to events; events generally tie to streams
if the ‘event’/ instance being measured is void of a stream, the measure’s measured value might be void of logical sense

Q2: The time spent of “time stepping” measured by cudaEvent****() (~17.221s) is twice as much as that measured by clock_gettime() (~8.43s). Someone tells me that asynchronization can be a possible reason, but I don’t really get it. can anyone help me get through it ?

how many streams do you use in your application?
you use the default stream with your event recordings; this may have an effect on timing

Dear Lind,
Ok so you want to time things on the host.

I did not spot any errors but one suggestion is you should
check for errors returned by all your calls to CUDA functions.

Another point is to remember although you pretty much have to use
some form of synchronize before you read the time on the host,
synchronize itself may have an impact on your application.

Bill