I’m still trying to optimize a real time application that uses quite a lot of unique kernels each ‘frame’ of this app - and I can never seem to get the performance I see in the profiler, in my real application.
For example, I have a list sorting kernel - which takes ~6us GPU time (~17us CPU time) in the profiler - but in my real application it takes 600-700us!!! (which is literally over 100 times slower!!!). Note: the 600us is just launching/executing the kernel, it doesn’t include pushing/popping the CUDA context, setting up parameters, etc.
I’m synchronizing before I even begin setting up the kernel, to make sure there’s no overhead from any previous kernels screwing with the performance of the kernel… I also synchronize before I record the timer start event (eg: before launching the kernel), and before I record the timer stop event (eg: after I launch the kernel) - and I get the timings I reported above (real app = 600us just for launching kernel, profiler = 6us GPU, 17us CPU).
I have similar performance differences between the profiler and my real application in all of my kernels (this is the most extreme I’m aware of though).
Any ideas what could cause my real application to be so much slower than in my test harness I run the profiler on? (note: I can’t actually use the profiler on my real time application - 1) it doesn’t record data properly, 2) there’s no reliable way to get a recording of all kernel executions).
Running on Windows XP 32bit, 181.20 drivers - CUDA 2.1.
Edit: I managed to change our demo a bit so it would run in the profiler - and the profiler still gives completely different results relative to what I recorded with CUDA’s events, and my own precision timing code - the profiler saying nice happy things like 20us, the CUDA events and my own timers both telling me 600us+…
Second Edit: Going over some rather longer performance timing logs - in one case I have a GPU time of 45us, but a CPU time of 6.709ms… over 140 times slower than when run stand-alone in an isolated test harness, where I get a reliable CPU time of about 80-100us.