By placing one and then the other inside this timing function I get the execution time of each in ms and this shows me that the first kernel (SOM_propogate… ) takes longer to execute than the second function (winner_OnDev…) However, when I run the profiler it tells me that the second function takes longer?
So why does my timing, and the profiler disagree on this???
Yes, in that example it is. I move the first kernel call above the timer start to time the second kernel, and then move the second one below the timer code to time the first. I put this code here so you can see how I’m doing the timing…
OK, so the next obvious question is how different are the times?
The usual caveat with profiling data is that it is only collected by instrumenting one MP with profile counters and the scaling up the results. So if one or other of your kernels has a vastly asymmetrical work distribution amongst blocks, you can get some curious differences (although they should average out over a number runs). I don’t think that extends to total kernel timing, however.
Ok so here is the profiler output in a graph (attached)
When I time the functions I get around 0.12ms for the SOM_propogate kernel and around 0.03ms for the winner_ kernel which would seem like quite a difference to me.
Good question, I’m not sure I know the answer. My program launches each kernel only once for now, the profiler works with the compiled code and runs it 5 times (I think) to average the results. Each time I get the output of the program which tells me the time it takes to run one of these functions. If I recompile to time the other function, obviously I get the time it takes, in each case the reported times are consistent with my previous post, and each time the profiler output is also consistent with the previous post but not with the timer function returns.
I should mention that I’m running CUDA 3.0, but the profiler is from CUDA 2.3 I suppose this could be the cause?
If it were my code, I wouldn’t be concluding anything on the basis of a single measurement, especially when the kernel durations are so short. I would be running each kernel a few thousand times and then comparing the mean and standard deviation of the run times. But that is just me…