Resident warp vs active warp


I’m currently try to understand the life cycle of Threads, Warps and Block.

A warp can be active or inactive. But what is a resident warp?

Is an active block can have inactive warp or thread?

I’m a little bit confuse about the this two words (active and resident)

Is someone can help me?

Many Thanks

PS: English is not my mother tongue…

The terms usually used by the profiler are:

active_warp - A warp is active if it has been allocated to an SM and all warp level resources (registers) have been allocated.

eligible_warp - An active warp is eligible if it can issue an instruction.

stalled_warp - An active warp is stalled if it is not able to issue an instruction due to a resource or data dependency.

I’m not surely what literature uses the term “resident”. A resident warp would be the same as an active warp.

Threads can be in different states.

Active Thread - A thread is active if its active bit is set in the warp active_mask.

Inactive Thread - A thread is inactive if its active bit is not set in the active_mask. This can happen if the warp take a divergent control path.

Exited Thread - A thread is inactive and exited if the thread has executed a EXIT instruction. Exited threads cannot become active again.

In short:

Resident threads is the ammount/number of threads the GPU can load into it’s chip’s memory.

Longer answer:

The GPU has a limited ammount of cores available. It cannot execute millions of threads at the same time. The GPU can only execute as many threads at the same time as it has cores available.

However sometimes some of these threads may stall for different reasons. Therefore the GPU uses a little trick. It has some additional memory which is used to store/load additional threads onto the GPU. These threads are not yet executed but they are initialized I suppose so that they can be executed at a moments notice.

This is what is referred to as resident threads… think of these as “on chip threads”. Like a cpu may store thread contexts on the stack in some cache somewhere I suppose.

So the GPU does not have to load threads from main memory or something… but it can quickly switch to these resident threads and execute those… a sort of hardware thread context switching.

It can then later return to stall threads and execute those if those are unblocked.

If all resident threads stall and get blocked the GPU will ultimately dead-lock.

So think of GPU as a batch based processor. The kernel’s threads must all exit if the GPU is not to dead-lock. No gpu thread must wait on the results of another thread or it may dead lock.

For example thread 1 to 10000 must not wait on the result of thread 1000000.

Because this would consume the GPU with threads 1 to 10000 or whatever it’s maximum resident threads is… and then it will never execute thread 1000000.
Thus threads 1 to 10000 will be waiting forever ;)

Thank you guys!

It’s more clear now.

For the resident warp and block: I found this terms in the Cuda Official Documentation:
For me a resident thread is a thread that have been allocated. Theses threads can be active or inactive. A resident warp is the same as an active warp and a resident block is the same has an active block.

I wonder how many blocks/warps is neended to hide the latency of memory accesse? Dose is it a function of the maxmim of resident blocks/warps?

If you are using the Nsight VSE profiler you can determine if you have sufficient warps to hide latency by looking at the Issue Efficiency experiment. See

If the kernel is fully hiding latency then every warp scheduler should be able to pick and eligible warp each cycle and issue 1 or 2 instructions. The Warp Issue Efficiency chart shows the percentage of active cycles that the warp scheduler was not able to issue an instruction. If “No Eligible” is high there are too options.

  1. Increase occupancy. The Warps Per SM chart show the theoretical occupancy and achieved occupancy in warps (vs. percentage). If the theoretical value is low (32) then you can go to the Achieved Occupancy experiment and determine the tradeoffs of increasing occupancy. If the theoretical occupancy is high but the achieved occupancy is low then either the launch dimensions (GridDim) does not fill the machine, there is a tail effect in blocks (portion of warps exit early in each block), or there is a tail effect in blocks. If the theory is high and the achieved is high then you have to resolve stalls. See the Issue Stall Reason Chart.

  2. Resolve stalls. The Issue Stall Reasons chart shows the percentage of time active warps were stalled. Removing the primary reason will improve the number of eligible warps.