timing and the profiler getting different results from each

anthonyfmorse · February 3, 2010, 10:00am

Hi, I’m calling 2 different kernel functions from my code and timing them using the following…

...

	cudaThreadSynchronize();

	cudaEventRecord(start, 0);

SOM_propogate_input_OnDevice <<< 10, SOM_size/10, size >>> (input_d, weights_d, activity_d);

  winner_OnDevice <<< 1, SOM_size, size >>> (activity_d, activity_index_d);

	cudaThreadSynchronize();

	//end timer

	cudaEventRecord(stop, 0);

	cudaEventSynchronize(stop);

	float elapsedTime;

	cudaEventElapsedTime(&elapsedTime, start, stop);

...

By placing one and then the other inside this timing function I get the execution time of each in ms and this shows me that the first kernel (SOM_propogate… ) takes longer to execute than the second function (winner_OnDev…) However, when I run the profiler it tells me that the second function takes longer?

So why does my timing, and the profiler disagree on this???

avidday · February 3, 2010, 10:04am

Maybe I am missing something, but isn’t you event timer measuring the execution duration of both kernels?

anthonyfmorse · February 3, 2010, 10:20am

Yes, in that example it is. I move the first kernel call above the timer start to time the second kernel, and then move the second one below the timer code to time the first. I put this code here so you can see how I’m doing the timing…

avidday · February 3, 2010, 10:34am

I hope you mean you put it above the cudaThreadSynchronize() call, not above the timer start, otherwise it still won’t be correct.

anthonyfmorse · February 3, 2010, 11:47am

indeed

avidday · February 3, 2010, 11:57am

OK, so the next obvious question is how different are the times?

The usual caveat with profiling data is that it is only collected by instrumenting one MP with profile counters and the scaling up the results. So if one or other of your kernels has a vastly asymmetrical work distribution amongst blocks, you can get some curious differences (although they should average out over a number runs). I don’t think that extends to total kernel timing, however.

anthonyfmorse · February 3, 2010, 12:57pm

Ok so here is the profiler output in a graph (attached)

When I time the functions I get around 0.12ms for the SOM_propogate kernel and around 0.03ms for the winner_ kernel which would seem like quite a difference to me.
Screen_shot_2010_02_03_at_12.54.28.png

avidday · February 3, 2010, 1:11pm

When you are profiling and timing these kernels, how many launches are you measuring?

anthonyfmorse · February 3, 2010, 2:08pm

Good question, I’m not sure I know the answer. My program launches each kernel only once for now, the profiler works with the compiled code and runs it 5 times (I think) to average the results. Each time I get the output of the program which tells me the time it takes to run one of these functions. If I recompile to time the other function, obviously I get the time it takes, in each case the reported times are consistent with my previous post, and each time the profiler output is also consistent with the previous post but not with the timer function returns.

I should mention that I’m running CUDA 3.0, but the profiler is from CUDA 2.3 I suppose this could be the cause?

avidday · February 3, 2010, 2:26pm

If it were my code, I wouldn’t be concluding anything on the basis of a single measurement, especially when the kernel durations are so short. I would be running each kernel a few thousand times and then comparing the mean and standard deviation of the run times. But that is just me…

anthonyfmorse · February 3, 2010, 2:47pm

Thanks aviday, I just tried exactly that on your advice and I get something in line with the profiler!

So lessen learned is not to rely on timing a single call to a kernel

Cheers

Topic		Replies	Views
On timing and timer CUDA Programming and Performance	7	4286	July 15, 2009
time measurement discrepancy timer, clock(), profiling CUDA Programming and Performance	4	6786	April 7, 2010
Profiler timings vs. real world timings. VERY different... CUDA Programming and Performance	8	2545	May 15, 2009
Profiler timing measurements wrong? Visual Profiler and nvprof	0	1920	June 3, 2015
Discrepancy between cudaEventElapsedTime and nvprof CUDA Programming and Performance	7	1646	March 11, 2016
Different GPU execution times using cuda events and cudaprof CUDA Programming and Performance	0	959	January 21, 2011
Profiler speeding up my kernels? Nvidia employees please read Weird timing behavior during profiler CUDA Programming and Performance	6	5919	November 9, 2009
Getting different time for kernel execution. CUDA Programming and Performance	6	6007	July 30, 2009
Profiler v. cudaEventSynchronize CUDA Programming and Performance	6	8243	March 27, 2008
Events vs Timers - big differences measurung kernel execution time CUDA Programming and Performance	7	2245	December 21, 2010

timing and the profiler getting different results from each

Related topics