I have a cuda kernel that runs on a large dataset that takes a significant amount of time. THe data is not needed immediately so I created a seperate OS-thread with a new Cuda Context to have the kernel execute in the background. This all works fine. However the display performance is a little slower than I would like. Is there a way I can control the amount of GPU resources a kernel uses so I can balance the screen fps with the background cuda computation?
CUDA kernels and the display driver time-slice the device, they don’t share it. The only way to improve the display responsiveness in that kind of situation is to reduce kernel execution time, either by reducing work per kernel call, or by making the kernel run faster.
Maybe I need to explain better. I have a huge distance volume that I’m generating. It’s a single kernel call that takes about 30 seconds to complete. During that time the display peformance drops from ~200 fps to about ~40fps. If the balancing is done with time slicing is there anyway to control how the slicing is happening? There are various things the user needs to do in this time. Hence I’m running this large kernel in a separate context in a separate thread.
When a kernel is running on a GPU it has total control of that GPU and the display manager cannot refresh the display. During the kernel execution, the display manager is effectively locked out of the GPU and the display is frozen. A single kernel on a shared display can never execute for more than 5 seconds, otherwise a display driver watchdog will kill the running kernel. There is no load balancing of any sort. Either the running kernel finishes and yields the GPU inside the 5 second watchdog timer window, or the driver kills it. This is, to the best of my knowledge, the same on every platform.
I don’t see how you can be running a single kernel for 30 seconds on a shared GPU with an application simultaneously rendering. Are you sure all of this is really happening on a single GPU?
A remarkable response… what I have described works… the timings are such that there can be no mistake. I have only one card. I launch a new thread create a cuda context for it and start the kernel. The display in the main thread continues to refresh.
What you described is not possible for a single kernel launch. Are you actually launching a single kernel many times? How would you be avoiding the timeout otherwise?
OK… forgive me… just went back to the original code… I forgot I actually did subdivide that processing into multiple kernel calls… without even knowing about the time limit…
That means I can achieve what I want but just subdividing them into smaller pieces and have the CPU thread that controls them run at a lower priority. OK great.
Now out of curiosity; how does this sort of thing scale with kernel setup overhead / context switching? Say I have a kernel that takes 5 seconds to run in one shot and I divide it into a 1000 steps will it take significantly longer… (assuming that I can still give it a full grid of threads in each substep)
It seems to be very platform specific, but at least on linux, the total kernel launch overhead seems to be down in the 100 microsecond range, which is probably negligible in this context. If you have host-device data transfer along with kernel launches, then you might find yourself at the mercy of PCI-e bus latency with very small time slices and transfer of small amounts of data. But if the individual kernel run times are in the order of 10s of milliseconds, then you shouldn’t see too much effect. You 5 second case might wind up taking 5.5 seconds.