Ok, so here’s the problem. Suppose I have two streams, one is compute bound and the other memory bound. They can run concurrently (they’re stages in a pipeline if anyone’s curious), and in fact, since one is memory bound, running them on the device at the same time should increase throughput since the compute bound one will relieve pressure on the memory system.
The question is how to get them to run concurrently, ideally in a carefully profiled ratio with respect to active blocks from each. If I naively execute kernels from both streams, the balance will swing around more or less at random, with one or the other having more active blocks at any given time. This will defeat the purpose of trying to load balance them in the first place.
Is there a good way to do this, or do I have to rely on careful grid sizing and bookkeeping to track the ratio at any given moment?