Hey all,
I have a kernel which I call multiple times (indefinitely until the program terminates) - for testing purposes (to attempt and diagnose what’s causing this performance problem) - I’ve made it so this kernel gets identical inputs, and I’ve checked that it’s giving me identical outputs each time…
On average, this kernel takes ~4ms to execute - however after an arbitrary number of executions (it’s happened after 2-3, sometimes after 30-1000+, sometimes never) it’s execution time runs up to 9-12ms (again, identical inputs/outputs, and no errors) - and consistently runs at it’s (2-3 times) slower speed for the rest of the processes life span.
To re-iterate, the code paths taken in both the kernel, and CPU (Driver API) code are identical, and receive/output identical values each time I call this function - all running in the same CPU thread & CUDA device.
I should note, each time I call this kernel - I push a context (which I create at the beginning of the program), allocate new memory, transfer the memory from host->device, synchronize the async memcpy’s, start event timer, launch the kernel, synchronize, stop event timer, clean up memory & pop context. Note: The timings I’m referring to are between the start and stop events.
My question is, what can be causing these unpredictable performance degradations? And more importantly, what can I do to avoid them entirely?
Cheers,
CUDA Version: 2.0
Operating System: Windows Vista (32bit)
Card: Quadro FX 570
Driver Version: 177.84
P.S. I’m also calling other kernels both before and after this kernel - however I synchronize both before and after this kernel - so I can’t see how those other kernels would affect the performance of this kernel?