How do warps IDs affect performance of CUDA kernels

I know that thread blocks are scheduled as warps by the warp schedulers of each SM.
My question is:
a) Does the order of the warps ID stay the same with every run of the same kernel?
if let’s say warp0, warp1, warp2 and warp3 are colocated on SM0 and the order of running warps is: warp0, warp3, warp2, warp0, warp3, warp1… Will the order be always the same when running this kernel?
b) Additionally, does the warps ID that are colocated on a specific SM affect the kernel’s performance?
Let’s say SM0 will be occupied by warp0, warp1, warp2 and warp3. Will the performance be different if SM0 would be occupied by warp0, warp1, warp9,and warp10? In both cases 4 warps would occupy SM0, but does the ID of each warp (and consequently the data that each warp accesses) affect the performance of the warp scheduler and the total’s kernel?

Thank you in advance!

a) CUDA doesn’t provide any guarantees of this that I am aware of
b) Possibly. It would be code dependent to some degree

Hi Robert, thank you for your reply.

Could you please clarify your answer (b) a little bit more?
What could make the difference?
Is it only up to the memory pattern access?

Thread blocks are rasterized into warps (32 threads) and warps are launched on SMSP (SM sub-partitions == warps schedulers).


  • The programming model provides no guarantee regarding the assignment of thread blocks to SMs or warps to SM sub-partitions (warp schedulers).
  • The programming model does guarantee that all threads in a thread block will be co-resident on the same SM.
  • There is no guarantee regarding the order of execution of warps. Scheduling order is not influenced by the warp ID.
  • On most GPUs the lower 2-bits of the warp ID indicate the SM sub-partition.


  • Yes. The co-location of thread blocks on SMs and assignment of warps to SMSP can impact kernel performance as each warp contends for shared resources including instruction issue slots, instruction pipelines, and cache accesses.
  • In terms of warps on an individual SM the goal is to have equal number of warps per SMSP. The CUDA profilers collect useful statistics per SM and per SMSP so you can determine if there is a balance issue.
  • The CUDA API does not provide any controls regarding assignment of work to SMs or warps to warp schedulers. MPS server provides some control at a higher level.