Hi all !
I’m trying to time my code. I have been told that cudaEvent****() can be applied. at the meantime, my original code use clock_gettime() to time. I print results measured by both cudaEvent****() and clock_gettime() as follows. That is what I’m really confused about.
measured by cudaEvent****()
init data structure: 1971.517578ms
establish context: 0.007296ms
rearrange data: 234.271423ms
copy data: 53.402176ms
time stepping: 17221.333984ms
measured by clock_gettime()
init data structure: 1.802874s
establish context: 20.541891s
rearrange data: 0.235464s
copy data: 0.051851s
time stepping: 8.429955s
note:
init data structure: totally work on CPU
establish context: one line only: cudaFree((void*)0);
rearrange data: totally work on CPU
copy data: transfer data from host to device.
time stepping: two kernel functions are involved
Q1: The time spent of “establish context” measured by cudaEvent****() (0.0072ms) is quite different from that measured by clock_gettime() (~20.5s). Actually, this part has only one line which establishes a context.
cudaFree(0)
How does this vast difference happen ?
Q2: The time spent of “time stepping” measured by cudaEvent****() (~17.221s) is twice as much as that measured by clock_gettime() (~8.43s). Someone tells me that asynchronization can be a possible reason, but I don’t really get it. can anyone help me get through it ?
Q3: The wall clock time spent is really close to the time measured by clock_gettime(). However, I’m told that cudaEvent****() is preferable in timing a cuda code. I don’t know which one I should shoose.
In the kernel you can access each SM’s hardware clock, using clock() or clock64()
BUT these clocks are not syncronised, ie each SM has its own independent clock.
Also each GPU has its own details. Except from C2050 (and perhaps others?)
device reset appears to re-synchonise all the clocks.
There is some example code which calls clock(), passes its results back to the host
and then the host estimates the total kernel elapse time from the worst case (longest)
SM elapse time, taking care of where the clock overflows 32-bits.
Look for the files fixed.cu and fitness.c in http://www.cs.ucl.ac.uk/staff/W.Langdon/ftp/gp-code/barracuda_gp.tar.gz
Thanks for your reply. I’ll try to figure this out. Nevertheless, I still want answers to my questions, If you have some more ideas or suggestions, don’t hesitate to tell me !
Q1: The time spent of “establish context” measured by cudaEvent****() (0.0072ms) is quite different from that measured by clock_gettime() (~20.5s). Actually, this part has only one line which establishes a context.
cudaFree(0)
How does this vast difference happen ?
cudaEventElaspsedTime() ties to events; events generally tie to streams
if the ‘event’/ instance being measured is void of a stream, the measure’s measured value might be void of logical sense
Q2: The time spent of “time stepping” measured by cudaEvent****() (~17.221s) is twice as much as that measured by clock_gettime() (~8.43s). Someone tells me that asynchronization can be a possible reason, but I don’t really get it. can anyone help me get through it ?
how many streams do you use in your application?
you use the default stream with your event recordings; this may have an effect on timing
Dear Lind,
Ok so you want to time things on the host.
I did not spot any errors but one suggestion is you should
check for errors returned by all your calls to CUDA functions.
Another point is to remember although you pretty much have to use
some form of synchronize before you read the time on the host,
synchronize itself may have an impact on your application.