Profiler speeding up my kernels? Nvidia employees please read Weird timing behavior during profiler

Hi,

I’m trying to get some benchmarking numbers out of a test port of a large montecarlo simulation our group has developed. These numbers will directly influence our purchasing decision, so you can image my surprise when running the profiler, I noticed that my timings of kernel calls indicated an extra factor of 2 speed increase over what my program normally does when run by itself. I need to know if this is real and why this is happening and I unfortunately dont have a lot of time to investigate myself, nor can I check if the program results are still correct. Can any nvidia people explain this as a known possibility and tell me if it represents real potential performance?

I report my numbers on monday morning, and given where our benchmarking is sitting now, it could determine whether we invest in a Tesla-based cluster or a traditional cluster (!)

Thank you for any help you can provide!

Daniel

Doesn’t sound real to me - are you sure the profiler ran through to completion and didn’t hit the profiler time limit? That’s the only case that I can think of where the profiler would appear to run faster. I think the default time limitis 30 seconds in the visual profiler. I don’t think there is one with the command line profiler.

Thinking about it I don’t think I’ve ever seen the output from a profile run giving different results. I suppose it could happen in the case of a race condition which the different runtime settings might expose.

I’m rather suprised you have no way of validating your output. I suppose you could sanity check the FLOPS/memory bandwidth to see if you have gone over the theoretical max, but I imagine this might not be possible in your case.

I would agree with all that. In my experience, code run under the profiler is usually about 15-25% slower than run normally (note the 30 second default cut off in the visual profiler). If you timings are real, and I don’t think they are, then the only possible thing I can think of is that the profiler only instruments a single multiprocessor and then scales the results. If your code structure is such that the performance from block to block can vary wildly, it might be that the analysis could be skewed in some way.

BTW, I think you are being pretty optimistic posting a plea for help at what is probably late on Friday evening in the US expecting a reply for a Monday deadline…

Sorry - misread that a bit. I had thought that you were measuing whole program execution time, however it seems you are measuring individual kernels. I’m still not quite clear what you mean.

If you’re saying that the profile results are different from your own timings then I’m not sure. I don’t know if the profiler targets one specific multiprocessor when timing kernels or not. I would trust individual times from the profiler more than your own timings.

Or are you saying that your own timings (not from the profiler) around each kernel call report a two-fold increase in speed when run under the profiler? This would be different, and would suggest (again) that you’re timing logic is flawed.

Either way program execution time seems a superior metric.

This isn’t exactly a new problem, I reported similar behaviour (much faster execution timings in visual profiler) not long after the first visual profiler was released…

Still not sure what the cause of the problem is, but needless to say this issue + the other bugs in the visual profiler (eg: incorrect and/or missing counters) means I don’t use this tool anymore.

If you’re looking at GPU timings only, keep in mind that there are additional sources of driver/OS overhead that are hidden there. (of course if that’s a factor of 2 difference, your app is not very well optimized)

I’m quite new to CUDA but so far I’ve written and tested about 10 different programs that are CUDA based but I can tell you that my timers and the profiler have so far been totally accurate.

What sometimes happens is that I’ve got some piece of code like

system("pause");

which causes the profiler not being able to finish and aborting after 30 s.

Btw, I wouldn’t be to quick about making a decision like that. I would simply tell them I need more time, just be honest.