Load Balancing Streams

Ok, so here’s the problem. Suppose I have two streams, one is compute bound and the other memory bound. They can run concurrently (they’re stages in a pipeline if anyone’s curious), and in fact, since one is memory bound, running them on the device at the same time should increase throughput since the compute bound one will relieve pressure on the memory system.

The question is how to get them to run concurrently, ideally in a carefully profiled ratio with respect to active blocks from each. If I naively execute kernels from both streams, the balance will swing around more or less at random, with one or the other having more active blocks at any given time. This will defeat the purpose of trying to load balance them in the first place.

Is there a good way to do this, or do I have to rely on careful grid sizing and bookkeeping to track the ratio at any given moment?

design the compute-bounded function and the memory-bounded function into the same kernel code.

Otherwise, getting the GPU scheduler to intermix blocks for execution from 2 different kernel launches is difficult.

Some other mechanisms you might consider:

  • clever use of stream priority (e.g. have a “background” task of memory-boundedness, and carefully mix in a foreground task of compute-boundedness, using stream priority.) Your block execution duration will need to be short for this to have a chance of doing anything useful.

  • GPU work distribution/specialization, using persistent threads - let the GPU issue work within the threadblock at a rate that is commensurate with the ratio you are trying to achieve. There might even be a clever way to use CDP instead.

I don’t think any of this is easy, or is likely to give you what I think you are asking for - an easy, “automatic” method to issue new work based on current loading.

Try using a persistent launch grid, where you perform a deterministic assignment of blocks to tho either memory or compute bound tasks.

Feeding new work packages into an already running grid and getting results out reliably could be a little tricky.


This is unfortunent. Since the two kernels have quite different block size requirements, the uberkernel ideas are probably out, although the idea of two carefully sized grids of persistant threads could be made to work.

What’s the current state of Cuda vs. the watchdog timer? Historically, long running grids would block the GPU and leave the system unresponsive. Has any of this changed? Is there a way for a kernel to check how long it’s been running and thus stop accepting new work at some point in order to terminate and give the GPU a chance to schedule external stuff, i.e. from the OS or other programs?