Profiler timings vs. real world timings. VERY different...

Smokey · April 15, 2009, 11:58pm

Hey,

I’m still trying to optimize a real time application that uses quite a lot of unique kernels each ‘frame’ of this app - and I can never seem to get the performance I see in the profiler, in my real application.

For example, I have a list sorting kernel - which takes ~6us GPU time (~17us CPU time) in the profiler - but in my real application it takes 600-700us!!! (which is literally over 100 times slower!!!). Note: the 600us is just launching/executing the kernel, it doesn’t include pushing/popping the CUDA context, setting up parameters, etc.

I’m synchronizing before I even begin setting up the kernel, to make sure there’s no overhead from any previous kernels screwing with the performance of the kernel… I also synchronize before I record the timer start event (eg: before launching the kernel), and before I record the timer stop event (eg: after I launch the kernel) - and I get the timings I reported above (real app = 600us just for launching kernel, profiler = 6us GPU, 17us CPU).

I have similar performance differences between the profiler and my real application in all of my kernels (this is the most extreme I’m aware of though).

Any ideas what could cause my real application to be so much slower than in my test harness I run the profiler on? (note: I can’t actually use the profiler on my real time application - 1) it doesn’t record data properly, 2) there’s no reliable way to get a recording of all kernel executions).

Running on Windows XP 32bit, 181.20 drivers - CUDA 2.1.

Edit: I managed to change our demo a bit so it would run in the profiler - and the profiler still gives completely different results relative to what I recorded with CUDA’s events, and my own precision timing code - the profiler saying nice happy things like 20us, the CUDA events and my own timers both telling me 600us+…

Second Edit: Going over some rather longer performance timing logs - in one case I have a GPU time of 45us, but a CPU time of 6.709ms… over 140 times slower than when run stand-alone in an isolated test harness, where I get a reliable CPU time of about 80-100us.

Smokey · April 20, 2009, 12:49am

I’m still unable to make sense of thus, I’m guessing due to the frequency in which I call all these kernels - the driver, or driver API, or ‘something’ is encountering some insane amount of overhead… because I’m still seeing cases where I havea GPU time of ~140us, and a CPU time of 6-7ms.

The only pattern I’ve managed to notice is that this does increase over time - within 2-3 seconds of these kernels executing each frame, it starts to reach the limit where CPU time is ~50-140 times that of the GPU time.

I’ve double/triple checked and there’s no memory leaking, I’m not executing increasing amounts of kernels - or anything silly like that - the CPU time simply increases dramatically. (this could still very well be related to the bug I raised last year.)

Anyone with experience running dozens of CUDA kernels in ‘real time’ (running each set of kernels at a minimum of 60Hz (I’m currently doing these at 120-250Hz)) - I would greatly appreciate any input.

Cheers,

gatoatigrado · April 20, 2009, 3:28am

I seem to be having problems with some sort of cpu overhead ([url=“http://forums.nvidia.com/index.php?showtopic=94773”]http://forums.nvidia.com/index.php?showtopic=94773[/url]). I haven’t looked at increasing times through execution though. I’m currently running 280 kernels / second average (host time).

What function are you using to record events though? The host cpu could be quite slow if it’s doing a kernel call. I don’t see a mismatch between my timers and kernel times (Linux gettimeofday).

Smokey · April 20, 2009, 3:48am

I use both the driver api’s events (and diff the time between the two), in addition to my own timing code (which uses QueryPerformance* (Counter/Frequency) in windows - gettimeofday for *nix) - the Driver APIs timing results, and my own timing results are more or less identical (~2us difference between the results).

My own timings, and the Driver API timings also agree with the visual profiler’s timings (now I’ve managed to match up the visual profiler’s data, with the timestamps I logged with my own timing data) - so this isn’t an error in timing the performance of the kernels - they’re literally just taking quite a lot of CPU time to complete - for reasons I don’t understand.

I’m running the thread that runs my CUDA kernels with realtime priority, not sure if that somehow disadvantages the Driver API (but we need it for the IPP stuff we also do).

I’m not really executing any major code in between kernel launches either (just some basic arithmetic and memory allocation/copying, nothing that touches the driver api nor the gpu) - so I really don’t understand why the CPU time shoots up excessively… hence I’m assuming it’s an issue with either the Driver API, or the driver itself - possibly something to do with scheduling/polling - because there’s absolutely ‘no way’ the driver, or the driver API would be doing up to 6ms worth of calculations/allocation/copying after a kernel - so it must simply be ‘waiting’ for no reason… or sleeping (which is even worse).

Hmm come to think of it I am telling CUDA to yield - this could in fact be an OS scheduling issue (will get back with more details) - but again, inside of a realtime thread I wouldn’t expect 6ms+ delays waiting for a kernel which only takes 140us.

gatoatigrado · April 20, 2009, 6:37am

To clarify, you see the “CPU Time” for the profiler being much higher than the “GPU time”? I didn’t… tell me if you find anything though.

Does anyone know what exactly “CPU time” and “GPU time” come from? Are these just different counters in the driver, both measured on the host?

Thanks so much,
Nicholas

Smokey · April 20, 2009, 10:49pm

Correct - my worst-case example is a kernel - which according to the Visual Profiler, has a “GPU Time” of about 140us, and a “CPU Time” of about 6000us (6ms).

My own timing code also verifies the ~6ms CPU Time.

The previous few frames, it had a more-or-less solid “GPU Time” of 100-140us, and a “CPU Time” starting at ~500us, then slowly increasing to ~6000us each frame…

I have similar patterns with most of my other kernels.

Edit: And to clarify further, at first I thought my own timings and the profiler timings were off, but it was the fact that I was only using the visual profiler in my test harness (which runs the kernel once, and stops)… and I only see this dramatic increase in “CPU Time” when running my kernels constantly (~30-40 kernels, 60-200+ times a second)…

gatoatigrado · April 21, 2009, 6:47am

When measuring the time with Linux’s “time” command, I’m curious why there’s a large proportion “sys” of time (about 3 secs to 9 secs user time); do you see the same?

I also tried running without X running to avoid “dual personality” switches of the card (cuda mode and graphics mode), but it didn’t seem to help. Tell me if you find anything.

Smokey · May 14, 2009, 5:29am

[s]It gets stranger…

I’ve now written a kernel, which when executed in the profiler - takes 17ms, but when executed normally - only takes 5ms… (as opposed to all my other kernels, which are faster in the profiler than in a real app)

I’ve officially lost faith in the profiling tool, I could handle the randomly missing data in various kernels (or worse, some kernels not having ANY data) between various launches - but this (incorrect timings)… is just stupid, being given completely incorrect information - is not helpful in any way what so ever.[/s]

Okay scratch that - I was testing on an older card we had laying around instead of my usual box… it appears to have some issues. Performance is roughly identical (app being slower, of course) than the profiler on my dev box.

tanmay.Learns · May 15, 2009, 12:41pm

Try removing the ‘cudaThreadSynchronize()’ call after every kernel.

Topic		Replies	Views
Profiler speeding up my kernels? Nvidia employees please read Weird timing behavior during profiler CUDA Programming and Performance	6	5829	November 9, 2009
How to explain the performance difference? CUDA Programming and Performance	7	3521	March 26, 2008
cuda visual profiler CUDA Programming and Performance	12	8174	July 30, 2008
On timing and timer CUDA Programming and Performance	7	4194	July 15, 2009
Profiler Kernel Speeds faster than cmd? CUDA Programming and Performance	4	6787	June 24, 2008
Reducing GPU Idle Time CUDA Programming and Performance	19	4474	June 14, 2022
Ubuntu Vs Windows Times on 970 GTX CUDA Programming and Performance	12	1110	October 16, 2015
timing and the profiler getting different results from each CUDA Programming and Performance	10	1773	February 3, 2010
Profiling in a code line resolution CUDA Programming and Performance	7	7068	December 6, 2011
Problem with profiler on Mac OS X 10.5? Profiler only kind of works? CUDA Programming and Performance	5	10454	February 9, 2010

Profiler timings vs. real world timings. VERY different...

Related topics