Hi all,
I’m trying to make various time measurements of some kernels of mine and I have a few questions concerning that.
For one of the time measurements, I’m interested in finding out how long it takes for all threads of the grid to execute, so I’m doing the following.
[codebox]
my_kernel(var1, var2, var3, clock_t * startTimesArray, clock_t * endTimesArray){
clock_t start, finish;
start = clock();
//kernel computations go here:
finish = clock();
startTimesArray[threadID] = start;
endTimesArray[threadID] = finish;
}[/codebox]
startTimesArray and endTimesArray are pointers to arrays in global memory which I later retreive onto the host. On the host, I;
-
search startTimesArray for the earliest start time, i.e smallest value…
-
search endTimesArray for the latest end time, i.e largest value…
-
substract smallest value from largest value and divide by clockrate to get the elapsed time…
What I’d like to know is:
-
Is this a sound method to measure the time elapsed within the kernel without having to worry about overhead for launching the kernel and so on?
-
I’m not sure about the architecture but would queries to the clock such as “start = clock();” and “finish = clock();” yield the same(or very close) results even if coming from different threads being executed on different multiprocessors? What I mean is, if:
-
thread X is running on MP 1
-
thread Y is running on MP 3
-
thread X and thread Y execute the statement “start = clock();” at the same time
would both thread’s “start” variable contain more or less the same value? Or does each multiprocessor have its own clock which may yield different values?
Thanks for all and any answers to my questions.