Question about CTA/warp lifecycle

lifu.hlf · December 7, 2025, 7:08am

Is my understanding correct that for both CTA and warp, once they become active/resident in the corresponding SM (for CTA) or SMSP (for warp), they will never become “inactive” until the SM/warp completes?

In particular,

CTA-to-SM

When a thread block (CTA) is launched, it is permanently assigned to one specific SM.
An assigned CTA will keep occupying one of the Max CTA per SM quota (e.g., 32 for H100) until ALL warps of the CTA have completed execution, only when will the slot be freed up for the next “unassigned” CTA.

Warp-to-SMSP

Within an SM, warps are distributed to one of the 4 SMSPs . Each warp is assigned to an SMSP for the lifetime of the warp. An assigned warp can never be “load-balanced” or “stolen” to another SMSP even within the same SM.
An active (aka resident) warp will keep occupying one of the max resident warps per SMSP quota (e.g., 64/4 =16 for H100) until the warp has completed, only when will the “slot” be freed up for the next inactive warp.

Robert_Crovella · December 7, 2025, 2:44pm

I think your assertions are generally a good mental model. A CTA should be thought of as permanently assigned to an SM (until it retires) for most considerations. Pre-emption does provide a mechanism by which an CTA could “move” from one SM to another. For this reason, the programming guide states that the smid special register value is not guaranteed to be the same for the lifetime of a threadblock.

Yes, an SM keeps using its slot until it fully retires.
Yes, a warp keeps using its slot until it fully retires.

I have experimentally convinced myself in the past that when a warp retires, even if its owning threadblock has not yet retired, that in some cases the resources used by that warp (e.g. registers) can become available for new CTA to be deposited on that SM.

Curefab · December 7, 2025, 5:17pm

‘used to state’ → the warp would be restored to the same SM now?

When does preemption happen? During debug, operating system task switching, operating system hibernation?

lifu.hlf · December 7, 2025, 11:38pm

Thank you @Robert_Crovella for the very helpful info!

I am also curious about the preemption and when that would happen, thanks!

Robert_Crovella · December 8, 2025, 4:27am

I didn’t say that. “used to state” means that in the past, the programming guide said a particular thing (and I linked to it, to show precisely what I am referring to), and now, it does not seem to say that thing ~~(at least, I could not find it.)~~ (see EDIT below).

I don’t have any further information. it was never well-specified to begin with. Furthermore, the programming guide has gone through a substantial rewrite recently - you don’t need to take my word for it, in my view it is self-evident.

As far as I know, its not specified anywhere. I would guess that debugging may use preemption. I would also guess that (“modern” time-sliced) context-switching involves preemption. In the past, I was fairly convinced that certain CDP 1.0 guarantees would require pre-emption in some cases, but that was just guesswork. And with CDP 2.0, I’m not sure if any mechanisms might use preemption. I don’t have any authoritative info about when preemption may be used. AFAIK it is not specified in any sort of exhaustive fashion anywhere.

EDIT:
I did locate it in the “new” programming guide, here. So no real change as far as inclusion goes. Note that text there:

The device runtime may reschedule thread blocks onto different SMs in order to more efficiently manage resources.

system · December 22, 2025, 4:28am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
some doubts about the task scheduling of NVIDIA GPU CUDA Programming and Performance	6	2360	May 26, 2017
About warp scheduller in one SM CUDA Programming and Performance	1	451	September 20, 2023
Can warps from different CTAs be coscheduled? CUDA Programming and Performance	5	370	July 6, 2024
If a warp exits, does it still take space in the SM CUDA Programming and Performance	2	830	August 28, 2015
How to understand "active thread block"? CUDA Programming and Performance	4	640	August 4, 2023
About Warps how Warps are allocated to SP/SM CUDA Programming and Performance	2	8428	September 11, 2009
How is a warp executed on a SM CUDA Programming and Performance hw , cuda	0	342	September 7, 2020
Resident warp vs active warp CUDA Programming and Performance	5	6662	January 20, 2017
preemption of GPU threads CUDA Programming and Performance	3	2985	April 2, 2013
How the 16 int cores in a processing block in SM execute when 32 integers in a warp is calculated? CUDA Programming and Performance cuda , board-design	4	1250	September 28, 2023

Question about CTA/warp lifecycle

Related topics