How do warps IDs affect performance of CUDA kernels

laurenluckiez · April 12, 2021, 12:33pm

Hello,
I know that thread blocks are scheduled as warps by the warp schedulers of each SM.
My question is:
a) Does the order of the warps ID stay the same with every run of the same kernel?
if let’s say warp0, warp1, warp2 and warp3 are colocated on SM0 and the order of running warps is: warp0, warp3, warp2, warp0, warp3, warp1… Will the order be always the same when running this kernel?
b) Additionally, does the warps ID that are colocated on a specific SM affect the kernel’s performance?
Let’s say SM0 will be occupied by warp0, warp1, warp2 and warp3. Will the performance be different if SM0 would be occupied by warp0, warp1, warp9,and warp10? In both cases 4 warps would occupy SM0, but does the ID of each warp (and consequently the data that each warp accesses) affect the performance of the warp scheduler and the total’s kernel?

Thank you in advance!

Robert_Crovella · April 12, 2021, 2:01pm

a) CUDA doesn’t provide any guarantees of this that I am aware of
b) Possibly. It would be code dependent to some degree

laurenluckiez · April 12, 2021, 2:13pm

Hi Robert, thank you for your reply.

Could you please clarify your answer (b) a little bit more?
What could make the difference?
Is it only up to the memory pattern access?

Greg · April 14, 2021, 6:33pm

Thread blocks are rasterized into warps (32 threads) and warps are launched on SMSP (SM sub-partitions == warps schedulers).

a.

The programming model provides no guarantee regarding the assignment of thread blocks to SMs or warps to SM sub-partitions (warp schedulers).
The programming model does guarantee that all threads in a thread block will be co-resident on the same SM.
There is no guarantee regarding the order of execution of warps. Scheduling order is not influenced by the warp ID.
On most GPUs the lower 2-bits of the warp ID indicate the SM sub-partition.

b.

Yes. The co-location of thread blocks on SMs and assignment of warps to SMSP can impact kernel performance as each warp contends for shared resources including instruction issue slots, instruction pipelines, and cache accesses.
In terms of warps on an individual SM the goal is to have equal number of warps per SMSP. The CUDA profilers collect useful statistics per SM and per SMSP so you can determine if there is a balance issue.
The CUDA API does not provide any controls regarding assignment of work to SMs or warps to warp schedulers. MPS server provides some control at a higher level.

Topic		Replies	Views
Warp partioning and scheduling for two dimensional grid size CUDA Programming and Performance	0	598	November 10, 2011
How to know the scheduling information about the kernel? CUDA Programming and Performance cuda	7	700	May 28, 2024
Scheduling Warps of different kernels in the same cycle on the same SM CUDA Programming and Performance	6	256	December 6, 2024
Simple summary of CUDA execution model An attempt to simplify and summarize various sources on execu CUDA Programming and Performance	7	5580	July 28, 2009
About warp scheduller in one SM CUDA Programming and Performance	1	429	September 20, 2023
About Warps how Warps are allocated to SP/SM CUDA Programming and Performance	2	8353	September 11, 2009
CUDA hardware level: Streaming Multiprocessor CUDA Programming and Performance	1	2650	April 27, 2015
Thread and Instruction Scheduling CUDA Programming and Performance	3	3338	August 17, 2007
Warp affectation to SMP on Pascal Architecture CUDA Programming and Performance	3	350	June 22, 2022
Question about threads per block and warps per SM CUDA Programming and Performance	13	16780	October 6, 2022

How do warps IDs affect performance of CUDA kernels

Related topics