Profiler timings vs. real world timings. VERY different...


I’m still trying to optimize a real time application that uses quite a lot of unique kernels each ‘frame’ of this app - and I can never seem to get the performance I see in the profiler, in my real application.

For example, I have a list sorting kernel - which takes ~6us GPU time (~17us CPU time) in the profiler - but in my real application it takes 600-700us!!! (which is literally over 100 times slower!!!). Note: the 600us is just launching/executing the kernel, it doesn’t include pushing/popping the CUDA context, setting up parameters, etc.

I’m synchronizing before I even begin setting up the kernel, to make sure there’s no overhead from any previous kernels screwing with the performance of the kernel… I also synchronize before I record the timer start event (eg: before launching the kernel), and before I record the timer stop event (eg: after I launch the kernel) - and I get the timings I reported above (real app = 600us just for launching kernel, profiler = 6us GPU, 17us CPU).

I have similar performance differences between the profiler and my real application in all of my kernels (this is the most extreme I’m aware of though).

Any ideas what could cause my real application to be so much slower than in my test harness I run the profiler on? (note: I can’t actually use the profiler on my real time application - 1) it doesn’t record data properly, 2) there’s no reliable way to get a recording of all kernel executions).

Running on Windows XP 32bit, 181.20 drivers - CUDA 2.1.

Edit: I managed to change our demo a bit so it would run in the profiler - and the profiler still gives completely different results relative to what I recorded with CUDA’s events, and my own precision timing code - the profiler saying nice happy things like 20us, the CUDA events and my own timers both telling me 600us+…

Second Edit: Going over some rather longer performance timing logs - in one case I have a GPU time of 45us, but a CPU time of 6.709ms… over 140 times slower than when run stand-alone in an isolated test harness, where I get a reliable CPU time of about 80-100us.

I’m still unable to make sense of thus, I’m guessing due to the frequency in which I call all these kernels - the driver, or driver API, or ‘something’ is encountering some insane amount of overhead… because I’m still seeing cases where I havea GPU time of ~140us, and a CPU time of 6-7ms.

The only pattern I’ve managed to notice is that this does increase over time - within 2-3 seconds of these kernels executing each frame, it starts to reach the limit where CPU time is ~50-140 times that of the GPU time.

I’ve double/triple checked and there’s no memory leaking, I’m not executing increasing amounts of kernels - or anything silly like that - the CPU time simply increases dramatically. (this could still very well be related to the bug I raised last year.)

Anyone with experience running dozens of CUDA kernels in ‘real time’ (running each set of kernels at a minimum of 60Hz (I’m currently doing these at 120-250Hz)) - I would greatly appreciate any input.


I seem to be having problems with some sort of cpu overhead ( I haven’t looked at increasing times through execution though. I’m currently running 280 kernels / second average (host time).

What function are you using to record events though? The host cpu could be quite slow if it’s doing a kernel call. I don’t see a mismatch between my timers and kernel times (Linux gettimeofday).

I use both the driver api’s events (and diff the time between the two), in addition to my own timing code (which uses QueryPerformance* (Counter/Frequency) in windows - gettimeofday for *nix) - the Driver APIs timing results, and my own timing results are more or less identical (~2us difference between the results).

My own timings, and the Driver API timings also agree with the visual profiler’s timings (now I’ve managed to match up the visual profiler’s data, with the timestamps I logged with my own timing data) - so this isn’t an error in timing the performance of the kernels - they’re literally just taking quite a lot of CPU time to complete - for reasons I don’t understand.

I’m running the thread that runs my CUDA kernels with realtime priority, not sure if that somehow disadvantages the Driver API (but we need it for the IPP stuff we also do).

I’m not really executing any major code in between kernel launches either (just some basic arithmetic and memory allocation/copying, nothing that touches the driver api nor the gpu) - so I really don’t understand why the CPU time shoots up excessively… hence I’m assuming it’s an issue with either the Driver API, or the driver itself - possibly something to do with scheduling/polling - because there’s absolutely ‘no way’ the driver, or the driver API would be doing up to 6ms worth of calculations/allocation/copying after a kernel - so it must simply be ‘waiting’ for no reason… or sleeping (which is even worse).

Hmm come to think of it I am telling CUDA to yield - this could in fact be an OS scheduling issue (will get back with more details) - but again, inside of a realtime thread I wouldn’t expect 6ms+ delays waiting for a kernel which only takes 140us.

To clarify, you see the “CPU Time” for the profiler being much higher than the “GPU time”? I didn’t… tell me if you find anything though.

Does anyone know what exactly “CPU time” and “GPU time” come from? Are these just different counters in the driver, both measured on the host?

Thanks so much,

Correct - my worst-case example is a kernel - which according to the Visual Profiler, has a “GPU Time” of about 140us, and a “CPU Time” of about 6000us (6ms).

My own timing code also verifies the ~6ms CPU Time.

The previous few frames, it had a more-or-less solid “GPU Time” of 100-140us, and a “CPU Time” starting at ~500us, then slowly increasing to ~6000us each frame…

I have similar patterns with most of my other kernels.

Edit: And to clarify further, at first I thought my own timings and the profiler timings were off, but it was the fact that I was only using the visual profiler in my test harness (which runs the kernel once, and stops)… and I only see this dramatic increase in “CPU Time” when running my kernels constantly (~30-40 kernels, 60-200+ times a second)…

When measuring the time with Linux’s “time” command, I’m curious why there’s a large proportion “sys” of time (about 3 secs to 9 secs user time); do you see the same?

I also tried running without X running to avoid “dual personality” switches of the card (cuda mode and graphics mode), but it didn’t seem to help. Tell me if you find anything.

[s]It gets stranger…

I’ve now written a kernel, which when executed in the profiler - takes 17ms, but when executed normally - only takes 5ms… (as opposed to all my other kernels, which are faster in the profiler than in a real app)

I’ve officially lost faith in the profiling tool, I could handle the randomly missing data in various kernels (or worse, some kernels not having ANY data) between various launches - but this (incorrect timings)… is just stupid, being given completely incorrect information - is not helpful in any way what so ever.[/s]

Okay scratch that - I was testing on an older card we had laying around instead of my usual box… it appears to have some issues. Performance is roughly identical (app being slower, of course) than the profiler on my dev box.

Try removing the ‘cudaThreadSynchronize()’ call after every kernel.