Load balancing Cuda contexts

VMan · November 9, 2009, 7:46pm

I have a cuda kernel that runs on a large dataset that takes a significant amount of time. THe data is not needed immediately so I created a seperate OS-thread with a new Cuda Context to have the kernel execute in the background. This all works fine. However the display performance is a little slower than I would like. Is there a way I can control the amount of GPU resources a kernel uses so I can balance the screen fps with the background cuda computation?

Thanks,

VIncent

avidday · November 9, 2009, 8:01pm

CUDA kernels and the display driver time-slice the device, they don’t share it. The only way to improve the display responsiveness in that kind of situation is to reduce kernel execution time, either by reducing work per kernel call, or by making the kernel run faster.

VMan · November 9, 2009, 8:16pm

Maybe I need to explain better. I have a huge distance volume that I’m generating. It’s a single kernel call that takes about 30 seconds to complete. During that time the display peformance drops from ~200 fps to about ~40fps. If the balancing is done with time slicing is there anyway to control how the slicing is happening? There are various things the user needs to do in this time. Hence I’m running this large kernel in a separate context in a separate thread.

tmurray · November 9, 2009, 8:22pm

if it’s a single kernel call, you shouldn’t be getting any display updates while that kernel is running…

avidday · November 9, 2009, 8:28pm

I am not sure I follow.

When a kernel is running on a GPU it has total control of that GPU and the display manager cannot refresh the display. During the kernel execution, the display manager is effectively locked out of the GPU and the display is frozen. A single kernel on a shared display can never execute for more than 5 seconds, otherwise a display driver watchdog will kill the running kernel. There is no load balancing of any sort. Either the running kernel finishes and yields the GPU inside the 5 second watchdog timer window, or the driver kills it. This is, to the best of my knowledge, the same on every platform.

I don’t see how you can be running a single kernel for 30 seconds on a shared GPU with an application simultaneously rendering. Are you sure all of this is really happening on a single GPU?

VMan · November 9, 2009, 8:29pm

A remarkable response… what I have described works… the timings are such that there can be no mistake. I have only one card. I launch a new thread create a cuda context for it and start the kernel. The display in the main thread continues to refresh.

tmurray · November 9, 2009, 8:32pm

What you described is not possible for a single kernel launch. Are you actually launching a single kernel many times? How would you be avoiding the timeout otherwise?

VMan · November 9, 2009, 8:56pm

OK… forgive me… just went back to the original code… I forgot I actually did subdivide that processing into multiple kernel calls… without even knowing about the time limit…

That means I can achieve what I want but just subdividing them into smaller pieces and have the CPU thread that controls them run at a lower priority. OK great.

Now out of curiosity; how does this sort of thing scale with kernel setup overhead / context switching? Say I have a kernel that takes 5 seconds to run in one shot and I divide it into a 1000 steps will it take significantly longer… (assuming that I can still give it a full grid of threads in each substep)

Thanks for your help

tmurray · November 9, 2009, 9:03pm

It may take longer. Additionally, having multiple contexts per GPU is never optimal as there is a switching penalty there.

avidday · November 9, 2009, 9:13pm

Which is what I suggested in my original reply.

It seems to be very platform specific, but at least on linux, the total kernel launch overhead seems to be down in the 100 microsecond range, which is probably negligible in this context. If you have host-device data transfer along with kernel launches, then you might find yourself at the mercy of PCI-e bus latency with very small time slices and transfer of small amounts of data. But if the individual kernel run times are in the order of 10s of milliseconds, then you shouldn’t see too much effect. You 5 second case might wind up taking 5.5 seconds.

Topic		Replies	Views
Need solution of "kernel launch timeout" from NVIDIA CUDA Programming and Performance	11	19367	March 4, 2009
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20105	May 4, 2007
Too much time for kernel launch latency CUDA Programming and Performance	9	2192	November 28, 2022
The Cuda 5 Second execution-time limit Finding a the way to work around the GDI timeout CUDA Programming and Performance	24	12671	July 26, 2010
Why CUDA kernel calls takes so long? CUDA Programming and Performance	2	1417	July 17, 2017
Invoking kernel from multiple PC processes CUDA Programming and Performance	1	5500	June 3, 2011
massive tasks cost too much time CUDA Programming and Performance	10	582	January 15, 2018
CUDA thread in background? CUDA Programming and Performance	10	15965	February 19, 2010
How to Launch Cuda kernel in different processes CUDA Programming and Performance	8	3586	November 6, 2018
Oscilating performance, Code total times variates CUDA Programming and Performance	10	10568	June 21, 2009

Load balancing Cuda contexts

Related topics