Hi all !
I’m trying to time my code. I have been told that cudaEvent****() can be applied. at the meantime, my original code use clock_gettime() to time. I print results measured by both cudaEvent****() and clock_gettime() as follows. That is what I’m really confused about.
measured by cudaEvent****()
init data structure: 1971.517578ms
establish context: 0.007296ms
rearrange data: 234.271423ms
copy data: 53.402176ms
time stepping: 17221.333984ms
measured by clock_gettime()
init data structure: 1.802874s
establish context: 20.541891s
rearrange data: 0.235464s
copy data: 0.051851s
time stepping: 8.429955s
init data structure: totally work on CPU
establish context: one line only: cudaFree((void*)0);
rearrange data: totally work on CPU
copy data: transfer data from host to device.
time stepping: two kernel functions are involved
Q1: The time spent of “establish context” measured by cudaEvent****() (0.0072ms) is quite different from that measured by clock_gettime() (~20.5s). Actually, this part has only one line which establishes a context.
How does this vast difference happen ?
Q2: The time spent of “time stepping” measured by cudaEvent****() (~17.221s) is twice as much as that measured by clock_gettime() (~8.43s). Someone tells me that asynchronization can be a possible reason, but I don’t really get it. can anyone help me get through it ?
Q3: The wall clock time spent is really close to the time measured by clock_gettime(). However, I’m told that cudaEvent****() is preferable in timing a cuda code. I don’t know which one I should shoose.