Fixing SMs for a kernel

vin123 · August 29, 2016, 6:55pm

Is it possible to fix number of SMs to be given to a particular kernel?

allanmac · August 29, 2016, 6:58pm

Yes: CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops

Robert_Crovella · August 29, 2016, 8:22pm

Generally speaking, it’s not possible. You don’t have control over the scheduling of threadblocks to SMs, nor is there any method to restrict SMs to be used by a particular kernel.

With some “extraordinary” programming techniques, it is possible to cause an kernel to only “occupy” certain SMs, but this is well outside the scope of typical CUDA programming.

I assume what allanmac is referring to would be scoping your kernel so as to launch only a certain number of blocks. This would have a side effect that you would only “occupy” that many SMs. However:

You wouldn’t get to pick or control the SMs
You would only be launching at most 1 block per SM, which is generally not a high-performance programming technique.

allanmac · August 29, 2016, 10:28pm

↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑

What @txbob said!

I like using uber-blocks (~1024 threads and all the registers and smem) but only when I have a workload that benefits from coordinating so many warps. An uber-block (vs. a single-warp micro-block) typically makes sense if you’re really really trying to squelch GMEM traffic.

MutantJohn · August 29, 2016, 10:37pm

Wait, are you saying that grid-stride loops are inefficient? Or are you referring to the specific configuration of grid stride looping implied by the OP?

allanmac · August 29, 2016, 10:49pm

Pretty sure @txbob is saying that a grid might be both resident on the GPU and stationary on a known number of multiprocessors if and only if the GPU was idle before launch and the grid blocks are larger than a few warps (most likely multiprocessor-spanning).

Some experimentation could verify how CUDA GPUs stripe blocks across available multiprocessors but this could never be relied upon.

The O.P.'s question is sort of asking if “device fission” is supported. It’s not (yet!). That’s a feature that’s typically only supported by OpenCL runtimes.

Robert_Crovella · August 30, 2016, 1:16am

Yes, I am pretty much interpreting this as a device fission question. I may be misinterpreting it.

A strategy (however it comes about) that only launches one block per SM may not be taking full advantage of the machine, i.e. exposing enough parallelism.

The two most important priorities for a GPU programmer are to effectively expose “enough” parallelism to saturate the machine, and to make effective use of the memory subsystem(s).

One measure of exposed parallelism is the number of active threads or warps that are resident on an SM. For many GPUs this has a maximum limit of 2048 threads (or 64 warps) and you cannot achieve this with a single threadblock.

No I am not saying grid stride loops are inefficient. The previous concept I was discussing (optimization/exposed parallelism) really has nothing to do with grid-stride loops. I can write a grid stride loop that uses 100,000 threads (probably exposing enough parallelism), and I can write a grid stride loop of exactly 1 thread (definitely not enough parallelism). These concepts are approximately orthogonal.

pszilard · August 30, 2016, 2:51pm

Good points there! I’d like to add that one significant thing is that in many application it is (and will be increasingly) a serious issue that not all tasks have enough parallelism for a 20-30 SM GPUs. However, even if the CPU would be perfectly capable to do the task, in order to optimize data locality, small kernels need to be executed inefficiently on the GPU. This will cause a number of issues ranging from delaying the critical path to preempting other kernels to simply executing at a very low parallel efficiency wrt running e.g. on a few SMs.

Therefore, I would very much like to see at least some simple ways to do “device fission” (e.g. assign streams to a set of SMs for exclusive or conditionally exclusive scheduling). While engineers from NVIDIA have previously acknowledged these issues, I have not received much feedback whether partitioning SMs is something that’s even considered at all.

Robert_Crovella · August 30, 2016, 3:03pm

I’m sure there are improvements that can address various use cases. I think using todays technology the idea would be to try and address the use case with

concurrent kernels
streams
stream priorities

MutantJohn · August 30, 2016, 3:09pm

Well, I’m glad to get some clarity! You scared me for a second, ha ha. But I wouldn’t say they’re 100% orthogonal concepts. If anything, that dot product is at least half of the product of the magnitude of the respective vectors :P

pszilard · August 30, 2016, 3:15pm

@txbob: I see no way to address the issue I raised without being able to control the scheduling and “width” of concurrent kernels. Additionally, it turns out that priorities can be nightmare when optimizing for the critical path (e.g. sequence of short kernels in a high-prio stream keep “loosing” the GPU to a long-running kernel in the low-prio stream). Am I missing something?

Robert_Crovella · August 30, 2016, 3:29pm

I don’t think you’re missing anything. I mentioned already that “I’m sure there are improvements that can address various use cases.” Today’s technology does not address all possible scenarios.

Without other descriptions or considerations, I think the general approach to handling small kernels would be to suggest the use of concurrency.

For the case of long running kernels intermixed with higher priority kernels, stream priorities (today) only impact scheduling priority at the threadblock launch level. For a “low priority” kernel whose threadblocks execute for a relatively long period of time, the stream priority system breaks down. The low priority threadblocks can occupy an SM and prevent launch of higher priority threadblocks.

One possible “solution” (not available today in CUDA AFAIK) is device fissioning, and reserving some portion of the device for traffic scheduled by the programmer, rather than by the runtime. Another possible “solution” would be to allow for pre-emption at the instruction level. Threadblocks from a higher priority kernel would simply pre-empt threadblocks from a lower priority kernel. Also not available today, AFAIK (although threadblock preemption may occur in some scenarios, e.g. CDP)

I view device fissioning as a relatively “crude” solution compared to the other I mention, but it’s a complex topic and this doesn’t cover all considerations by any means.

By the way if you arrange the work associated with a threadblock to be relatively “short”, then even a “long running low priority” kernel can still be effectively “preempted” by (threadblocks of) a high-priority kernel. This suggestion is potentially counter to optimizing for maximum performance of the low priority kernel, however device fissioning certainly does not optimally apply resources either, according to my understanding.

Topic		Replies	Views
Question about the number of SMs using in the program. CUDA Programming and Performance	3	863	April 9, 2018
Running CUDA kernels from two different pthreads CUDA Programming and Performance	7	3007	May 10, 2016
How to specific the number of SMs used in my program? CUDA Programming and Performance	1	838	April 9, 2018
How to schedule the kernel to a specified SM? GPU-Accelerated Libraries cuda , kernel	3	606	April 12, 2024
Scheduling blocks to SMs at runtime CUDA Programming and Performance	7	2923	October 27, 2008
Allow GPU kernel to use only specific SM CUDA Programming and Performance cuda	2	602	October 12, 2021
More blocks than SMs may not make sense CUDA Programming and Performance	13	2866	November 11, 2010
Run different kernels parallely on different SMs CUDA Programming and Performance	4	1200	June 22, 2018
How the blocks of different kernels execute on fermi card? CUDA Programming and Performance	2	1644	June 7, 2011
concurrenct execution of kenerls fermi CUDA Programming and Performance	0	1362	March 21, 2012

Fixing SMs for a kernel

Related topics