Profiler speeding up my kernels? Nvidia employees please read Weird timing behavior during profiler

dmacdonald · November 7, 2009, 3:34am

Hi,

I’m trying to get some benchmarking numbers out of a test port of a large montecarlo simulation our group has developed. These numbers will directly influence our purchasing decision, so you can image my surprise when running the profiler, I noticed that my timings of kernel calls indicated an extra factor of 2 speed increase over what my program normally does when run by itself. I need to know if this is real and why this is happening and I unfortunately dont have a lot of time to investigate myself, nor can I check if the program results are still correct. Can any nvidia people explain this as a known possibility and tell me if it represents real potential performance?

I report my numbers on monday morning, and given where our benchmarking is sitting now, it could determine whether we invest in a Tesla-based cluster or a traditional cluster (!)

Thank you for any help you can provide!

Daniel

Tigga · November 7, 2009, 4:03pm

Doesn’t sound real to me - are you sure the profiler ran through to completion and didn’t hit the profiler time limit? That’s the only case that I can think of where the profiler would appear to run faster. I think the default time limitis 30 seconds in the visual profiler. I don’t think there is one with the command line profiler.

Thinking about it I don’t think I’ve ever seen the output from a profile run giving different results. I suppose it could happen in the case of a race condition which the different runtime settings might expose.

I’m rather suprised you have no way of validating your output. I suppose you could sanity check the FLOPS/memory bandwidth to see if you have gone over the theoretical max, but I imagine this might not be possible in your case.

avidday · November 7, 2009, 4:19pm

I would agree with all that. In my experience, code run under the profiler is usually about 15-25% slower than run normally (note the 30 second default cut off in the visual profiler). If you timings are real, and I don’t think they are, then the only possible thing I can think of is that the profiler only instruments a single multiprocessor and then scales the results. If your code structure is such that the performance from block to block can vary wildly, it might be that the analysis could be skewed in some way.

BTW, I think you are being pretty optimistic posting a plea for help at what is probably late on Friday evening in the US expecting a reply for a Monday deadline…

Tigga · November 7, 2009, 4:51pm

Sorry - misread that a bit. I had thought that you were measuing whole program execution time, however it seems you are measuring individual kernels. I’m still not quite clear what you mean.

If you’re saying that the profile results are different from your own timings then I’m not sure. I don’t know if the profiler targets one specific multiprocessor when timing kernels or not. I would trust individual times from the profiler more than your own timings.

Or are you saying that your own timings (not from the profiler) around each kernel call report a two-fold increase in speed when run under the profiler? This would be different, and would suggest (again) that you’re timing logic is flawed.

Either way program execution time seems a superior metric.

Smokey · November 8, 2009, 10:18pm

This isn’t exactly a new problem, I reported similar behaviour (much faster execution timings in visual profiler) not long after the first visual profiler was released…

Still not sure what the cause of the problem is, but needless to say this issue + the other bugs in the visual profiler (eg: incorrect and/or missing counters) means I don’t use this tool anymore.

tmurray · November 9, 2009, 1:30am

If you’re looking at GPU timings only, keep in mind that there are additional sources of driver/OS overhead that are hidden there. (of course if that’s a factor of 2 difference, your app is not very well optimized)

Jimmy_Pettersson · November 9, 2009, 9:52am

I’m quite new to CUDA but so far I’ve written and tested about 10 different programs that are CUDA based but I can tell you that my timers and the profiler have so far been totally accurate.

What sometimes happens is that I’ve got some piece of code like

system("pause");

which causes the profiler not being able to finish and aborting after 30 s.

Btw, I wouldn’t be to quick about making a decision like that. I would simply tell them I need more time, just be honest.

Topic		Replies	Views
Profiler Kernel Speeds faster than cmd? CUDA Programming and Performance	4	6780	June 24, 2008
Profiler timings vs. real world timings. VERY different... CUDA Programming and Performance	8	2407	May 15, 2009
Cuda Profiler Performance boost? CUDA Programming and Performance	4	7426	October 16, 2008
timing and the profiler getting different results from each CUDA Programming and Performance	10	1766	February 3, 2010
CUDA Visual Profiler Dies During Long Programs CUDA Programming and Performance	2	3424	August 5, 2010
Kernel Launch Time (CPU Time) Reported in Visual Profiler how to optimize kernel launch CUDA Programming and Performance	1	683	July 7, 2011
Kernel Overhead/Profiler Accuracy CUDA Programming and Performance	4	6395	May 25, 2008
Kernel Launch Time (CPU Time) Reported in Visual Profiler how to optimize kernel launch CUDA Programming and Performance	0	3725	January 13, 2011
Profiler timing measurements wrong? Visual Profiler and nvprof	0	1892	June 3, 2015
visual studio performance profiler on CUDA code CUDA Programming and Performance	1	6919	March 20, 2008

Profiler speeding up my kernels? Nvidia employees please read Weird timing behavior during profiler

Related topics