timing and the profiler getting different results from each

Hi, I’m calling 2 different kernel functions from my code and timing them using the following…

...

	cudaThreadSynchronize();

	cudaEventRecord(start, 0);

SOM_propogate_input_OnDevice <<< 10, SOM_size/10, size >>> (input_d, weights_d, activity_d);

  winner_OnDevice <<< 1, SOM_size, size >>> (activity_d, activity_index_d);

	cudaThreadSynchronize();

	//end timer

	cudaEventRecord(stop, 0);

	cudaEventSynchronize(stop);

	float elapsedTime;

	cudaEventElapsedTime(&elapsedTime, start, stop);

...

By placing one and then the other inside this timing function I get the execution time of each in ms and this shows me that the first kernel (SOM_propogate… ) takes longer to execute than the second function (winner_OnDev…) However, when I run the profiler it tells me that the second function takes longer?

So why does my timing, and the profiler disagree on this???

Maybe I am missing something, but isn’t you event timer measuring the execution duration of both kernels?

Yes, in that example it is. I move the first kernel call above the timer start to time the second kernel, and then move the second one below the timer code to time the first. I put this code here so you can see how I’m doing the timing…

I hope you mean you put it above the cudaThreadSynchronize() call, not above the timer start, otherwise it still won’t be correct.

indeed

OK, so the next obvious question is how different are the times?

The usual caveat with profiling data is that it is only collected by instrumenting one MP with profile counters and the scaling up the results. So if one or other of your kernels has a vastly asymmetrical work distribution amongst blocks, you can get some curious differences (although they should average out over a number runs). I don’t think that extends to total kernel timing, however.

Ok so here is the profiler output in a graph (attached)

When I time the functions I get around 0.12ms for the SOM_propogate kernel and around 0.03ms for the winner_ kernel which would seem like quite a difference to me.
Screen_shot_2010_02_03_at_12.54.28.png

When you are profiling and timing these kernels, how many launches are you measuring?

Good question, I’m not sure I know the answer. My program launches each kernel only once for now, the profiler works with the compiled code and runs it 5 times (I think) to average the results. Each time I get the output of the program which tells me the time it takes to run one of these functions. If I recompile to time the other function, obviously I get the time it takes, in each case the reported times are consistent with my previous post, and each time the profiler output is also consistent with the previous post but not with the timer function returns.

I should mention that I’m running CUDA 3.0, but the profiler is from CUDA 2.3 I suppose this could be the cause?

If it were my code, I wouldn’t be concluding anything on the basis of a single measurement, especially when the kernel durations are so short. I would be running each kernel a few thousand times and then comparing the mean and standard deviation of the run times. But that is just me…

Thanks aviday, I just tried exactly that on your advice and I get something in line with the profiler!

So lessen learned is not to rely on timing a single call to a kernel

Cheers