Fixing SMs for a kernel

Is it possible to fix number of SMs to be given to a particular kernel?

Yes: CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops

Generally speaking, it’s not possible. You don’t have control over the scheduling of threadblocks to SMs, nor is there any method to restrict SMs to be used by a particular kernel.

With some “extraordinary” programming techniques, it is possible to cause an kernel to only “occupy” certain SMs, but this is well outside the scope of typical CUDA programming.

I assume what allanmac is referring to would be scoping your kernel so as to launch only a certain number of blocks. This would have a side effect that you would only “occupy” that many SMs. However:

  1. You wouldn’t get to pick or control the SMs
  2. You would only be launching at most 1 block per SM, which is generally not a high-performance programming technique.

↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑

What @txbob said!

I like using uber-blocks (~1024 threads and all the registers and smem) but only when I have a workload that benefits from coordinating so many warps. An uber-block (vs. a single-warp micro-block) typically makes sense if you’re really really trying to squelch GMEM traffic.

Wait, are you saying that grid-stride loops are inefficient? Or are you referring to the specific configuration of grid stride looping implied by the OP?

Pretty sure @txbob is saying that a grid might be both resident on the GPU and stationary on a known number of multiprocessors if and only if the GPU was idle before launch and the grid blocks are larger than a few warps (most likely multiprocessor-spanning).

Some experimentation could verify how CUDA GPUs stripe blocks across available multiprocessors but this could never be relied upon.

The O.P.'s question is sort of asking if “device fission” is supported. It’s not (yet!). That’s a feature that’s typically only supported by OpenCL runtimes.

Yes, I am pretty much interpreting this as a device fission question. I may be misinterpreting it.

A strategy (however it comes about) that only launches one block per SM may not be taking full advantage of the machine, i.e. exposing enough parallelism.

The two most important priorities for a GPU programmer are to effectively expose “enough” parallelism to saturate the machine, and to make effective use of the memory subsystem(s).

One measure of exposed parallelism is the number of active threads or warps that are resident on an SM. For many GPUs this has a maximum limit of 2048 threads (or 64 warps) and you cannot achieve this with a single threadblock.

No I am not saying grid stride loops are inefficient. The previous concept I was discussing (optimization/exposed parallelism) really has nothing to do with grid-stride loops. I can write a grid stride loop that uses 100,000 threads (probably exposing enough parallelism), and I can write a grid stride loop of exactly 1 thread (definitely not enough parallelism). These concepts are approximately orthogonal.

Good points there! I’d like to add that one significant thing is that in many application it is (and will be increasingly) a serious issue that not all tasks have enough parallelism for a 20-30 SM GPUs. However, even if the CPU would be perfectly capable to do the task, in order to optimize data locality, small kernels need to be executed inefficiently on the GPU. This will cause a number of issues ranging from delaying the critical path to preempting other kernels to simply executing at a very low parallel efficiency wrt running e.g. on a few SMs.

Therefore, I would very much like to see at least some simple ways to do “device fission” (e.g. assign streams to a set of SMs for exclusive or conditionally exclusive scheduling). While engineers from NVIDIA have previously acknowledged these issues, I have not received much feedback whether partitioning SMs is something that’s even considered at all.

I’m sure there are improvements that can address various use cases. I think using todays technology the idea would be to try and address the use case with

  1. concurrent kernels
  2. streams
  3. stream priorities

Well, I’m glad to get some clarity! You scared me for a second, ha ha. But I wouldn’t say they’re 100% orthogonal concepts. If anything, that dot product is at least half of the product of the magnitude of the respective vectors :P

@txbob: I see no way to address the issue I raised without being able to control the scheduling and “width” of concurrent kernels. Additionally, it turns out that priorities can be nightmare when optimizing for the critical path (e.g. sequence of short kernels in a high-prio stream keep “loosing” the GPU to a long-running kernel in the low-prio stream). Am I missing something?

1 Like

I don’t think you’re missing anything. I mentioned already that “I’m sure there are improvements that can address various use cases.” Today’s technology does not address all possible scenarios.

Without other descriptions or considerations, I think the general approach to handling small kernels would be to suggest the use of concurrency.

For the case of long running kernels intermixed with higher priority kernels, stream priorities (today) only impact scheduling priority at the threadblock launch level. For a “low priority” kernel whose threadblocks execute for a relatively long period of time, the stream priority system breaks down. The low priority threadblocks can occupy an SM and prevent launch of higher priority threadblocks.

One possible “solution” (not available today in CUDA AFAIK) is device fissioning, and reserving some portion of the device for traffic scheduled by the programmer, rather than by the runtime. Another possible “solution” would be to allow for pre-emption at the instruction level. Threadblocks from a higher priority kernel would simply pre-empt threadblocks from a lower priority kernel. Also not available today, AFAIK (although threadblock preemption may occur in some scenarios, e.g. CDP)

I view device fissioning as a relatively “crude” solution compared to the other I mention, but it’s a complex topic and this doesn’t cover all considerations by any means.

By the way if you arrange the work associated with a threadblock to be relatively “short”, then even a “long running low priority” kernel can still be effectively “preempted” by (threadblocks of) a high-priority kernel. This suggestion is potentially counter to optimizing for maximum performance of the low priority kernel, however device fissioning certainly does not optimally apply resources either, according to my understanding.