What is the sequence of SM launching?


I see this. I know the launching sequence of block (at least 1D case) is according to blockIdx.x. But how about SM? What is the rule of block index binding to SM index?(Can we somehow control it?)

It’s not documented. The linked article attempts to draw a correlation experimentally. I’m not sure what the purpose would be.

CUDA provides no order of block execution and provides no order of SM execution. Any code that depends on such an order for correctness is broken by definition.

Therefore attempts to discover such orders have questionable value, in my view. A proper CUDA program should not depend on any such presumed order.

There are no rules for these things, and no explicit controls are given. It is possible to create your own order using atomics. There are various forum posts about this, a bit of searching will locate one for you.

Thank you!!!

Emmm, I am trying to bind specific blocks to specific SMs… For block, I can use atomics, but for SMs… what can I do?

Like, I want first block (which is running in SM index T), and then, next block, also run on SM T, to somehow reuse L1 or L2. How can I do that?

As Robert_Crovella said, you cannot control which block is assigned to which SM.
With ptx you can find out the SM id of a running block. PTX ISA 8.3


Emmm, well… but cutlass somehow uses blockid: cutlass/include/cutlass/gemm/threadblock/threadblock_swizzle.h at main · NVIDIA/cutlass · GitHub

I think here we have something to learn…

I do not see where the linked CUTLASS code contains anything relevant to the question at hand? Nowhere does it assign thread blocks to specific SMs (which is not possible, as previously noted).

yeah, I mean, they use “block will be launched according to their blockIdx.x” property to arrange GEMM order.

Or maybe you could see this?


How they control block to SM relation?

I guess, no block to SM connection, just they use “block will be launched according to their blockIdx.x” property and these blocks will be nearby.

When I looked at the code yesterday, I did not see thread block to SM mapping in the code, I saw matrix tile to thread block mapping,which would be a normal part of any tiled GEMM code.

Thank you for your fast reply!!! Yeah, but if we want to utilize L2 efficiently, these matrix tile should be launched according to time sequence, so, cutlass is really using “block will be launched according to their blockIdx.x” property…

Anyway, I think we have reached an agreement here. Haha.

Actually, I know %smid can print out the SM number, I am trying to Reverse engineer the SM-block relation recently. Like this:

Started block  2 on SM  4 at 0.
Started block  3 on SM  6 at 0.
Started block 10 on SM  5 at 0.
Started block 11 on SM  7 at 0.
Started block 18 on SM  4 at 0.
Started block 19 on SM  6 at 0.
Started block 26 on SM  5 at 0.
Started block 27 on SM  7 at 0.
Started block 14 on SM 13 at 0.
Started block 15 on SM 15 at 0.
Started block  6 on SM 12 at 0.
Started block  7 on SM 14 at 0.
Started block  0 on SM  0 at 0.
Started block  1 on SM  2 at 0.
Started block  4 on SM  8 at 0.
Started block  5 on SM 10 at 0.
Started block 22 on SM 12 at 0.
Started block 23 on SM 14 at 0.
Started block  8 on SM  1 at 0.
Started block  9 on SM  3 at 0.
Started block 12 on SM  9 at 0.
Started block 13 on SM 11 at 0.
Started block 16 on SM  0 at 0.
Started block 17 on SM  2 at 0.
Started block 20 on SM  8 at 0.
Started block 21 on SM 10 at 0.
Started block 24 on SM  1 at 0.
Started block 25 on SM  3 at 0.
Started block 28 on SM  9 at 0.
Started block 29 on SM 11 at 0.

Also take care that the order of printf output could be rearranged.

I think there is value to get a general idea, how blocks are distributed on SMs, even if one cannot rely on it.

If you can find out the SM, a block is running on, you can dynamically assign work packages. E.g. use %sm instead of blockIdx.x for choosing the work, or in combination of both e.g. %sm * 2 | blockIdx.x % 2 for two blocks per SM.

1 Like

Great idea! Instead of assign work using blockIdx.x, we can use %smid now!
Emmm, but we do not know which block will be running on which SM, right…? Any further idea?

It is not necessary to know, which block number (blockIdx.x) is serving the current block. You would use %smid everywhere, where you used blockIdx.x before. Ok, this is quite simplified. I give a more complete outline below.

A flexible production program would run the following sequence starting with some heuristics and setups:

// First programmatically get architecture data / compute capability for the installed GPU to know, what number of SMs to expect and how L2 is organized. There are some GPUs with two separate L2 caches, where near hits are faster than far hits and far hits are potentially duplicated taking away L2 cache space.

// Next run a test kernel, which finds out, whether all SMs are active. For a mobile GPU, a power-saving mode can be activated, where SMs sleep. For a server GPU, only a partition can be available. For better yield defect SMs can be deactivated for some GPU variants (or for marketing reasons even good SMs can be deactivated), e.g. x070, x080 and x090 could actually be the same GPU.

// In the following, 8 blocks are run per SM, i.e. 2 blocks per SM Partition. We use a GPU with 30 SMs. => 240 blocks

// create a structure in global memory with 240 flags, initialized with 0. They indicate, if the numbered block has been started

// Kernel:

int sm = %smid; // e.g. 0..29
int smp = (%warpid & 0xC) >> 2; // (example, typically two very specific bits indicating one of the 4 SM Partition the warp is running on) // 0..3
int blockbit = blockIdx.x & 1; // choose a bit, which typically is different from smp and different from sm; alternatively we can have just 120 flags and choose this bit with the CAS instruction // 0..1

int smblock =sm * 8 + smp * 2 + blockbit;
int idxblock = blockIdx.x;

if smblock < 240 // if sm number is greater 30; this should not happen
  if !atomicCAS(&flags[smblock], 0, 1) // was the work already done by a different block?
     work on smblock;

  if !atomicCAS(&flags[idxblock], 0, 1) // only as a safeguard
     work on idxblock;

First the kernel works on the blocks given by the %smid number.

In a second step each kernel tests, whether all blocks were completed, otherwise they work on the blocks given by the blockIdx.x number to ensure as a safeguard that all blocks are completed once.

Reasons could be that other processes or streams start kernel, 3D or desktop graphics can be output at the same time, so a SM can be skipped and another can get 2 blocks.

By this code each block is guaranteed to be worked on once. And with a normal program run, the blocks are ordered by SM number.

With most GPUs there won’t be any noticable difference to sort the blocks according to SM number.

1 Like

Alternatively with Hopper you have the notion of thread block cluster. With them the blocks know which other blocks are in the neighborhood GPU architecture-/die-wise, which should also help with L2. (Hopper GH100 does have a split L2 - 2*24 MB, so it matters, on which half a block is executed).

1 Like

Very good answer!!! It really helps.

Just one question, why you mentioned warpid and four scheduler? Like… nearby warp work together would be better? (Never hear about that! Sounds like a new idea)

Actually I read HPCA24 paper recently: https://www.nealcrago.com/wp-content/uploads/WASP_HPCA2024_preprint.pdf

I see “pipe_stageId”, just like smid, but my code can not print it out. Maybe need to wait for new CUDA toolkit. (Just share this interesting news with you ^.^)

The paper is a case-study on a simulated modified gpu. It is no surprise that “pipe_stageId” does not exist in current code / hardware

For the past few architectures, each SM comprises 4 SM Partitions. The SM Partitions share some resources like shared memory, but have independent registers, execution units for arithmetic operations and schedulers. Each running warp is assigned to a specific SM Partition and stays there (there is no official guarantee that a warp is not reassigned, but in practice this does not happen). So in practice, one typically wants to fill all the SM Partitions to take full use of the compute power and not waste them on empty SM Partitions. To hide latency, one should run more than one warp per SM Partition. The SM Partition 0…3 within a SM can be deduced by two bits of the %warpid.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.