How to program different behaviors of 4 partitions in 1 SM in ADA arch?

gushangqing007 · October 8, 2024, 11:07pm

According to page 11 of https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf, there are four partitions in each SM. My understanding is that each partition is mapped to one warp at a time, so there can be four concurrent warps at the same time. Is it possible to program in CUDA to specify different behavior for the warp from a specific partition? For example, I want all and only threads from the first partition to load data from global and cache in L1 (using ca modifier) while threads from other partitions to not update L1 (using lu modifier). How can I achieve this?

Thanks

rs277 · October 8, 2024, 11:48pm

No, there is no way to direct operations at this level. Greg’s SM overview gives a good explanation of operation - warps can change on a cycle by cycle basis.

Greg · October 9, 2024, 2:09am

To what end would you choose to specialize work on a partition?

I ask as specialization per SM sub-partition (SMSP in profiler terms) will almost always results in lower throughput. Warp specialization as used by cutlass library specializes the work done by specific warps in a thread block with ultimate goal to have equal specialized work on each SM sub-partition. This allows the fewest number of warps to use the most resources while optimizing throughput, reducing communication overhead, and increasing determinism.

Curefab · October 9, 2024, 2:05pm

You can try to detect the SM partition and then call a different function at the top-level of your kernel.
(I think with most GPUs it is (warpid % 4)?
https://docs.nvidia.com/cuda/parallel-thread-execution/#special-registers-warpid

Each warp on the SM is assigned to one of the 4 partitions. And the scheduler of each partition selects one available warp of the assigned ones.

gushangqing007 · October 9, 2024, 4:26pm

Thank you for the reply and good information. I don’t have any real use case. I was just studying the document I linked and tried to create some tests for better understanding of the architecture.

gushangqing007 · October 9, 2024, 4:29pm

Thank you! This overview is very helpful.

Topic		Replies	Views
Per-Thread Repeated Access into Small Shared Float Array CUDA Programming and Performance	8	1207	March 26, 2019
How do CUDA cores on a SM execute warps concurrently? CUDA Programming and Performance	8	28694	July 4, 2019
Scheduling Warps of different kernels in the same cycle on the same SM CUDA Programming and Performance	6	101	December 6, 2024
Can warps from different CTAs be coscheduled? CUDA Programming and Performance	5	222	July 6, 2024
Scheduling individual threads CUDA Programming and Performance	4	4579	June 1, 2009
CUDA execution mapping onto GPUs CUDA Programming and Performance	0	2818	March 2, 2009
Simple summary of CUDA execution model An attempt to simplify and summarize various sources on execu CUDA Programming and Performance	7	5563	July 28, 2009
register allocation behaviour CUDA Programming and Performance	2	420	January 9, 2019
How to understand "active thread block"? CUDA Programming and Performance	4	533	August 4, 2023
Blocks/Warps/Threads Allocation I have some doubts about the allocation of blocks/warps/thread in CU CUDA Programming and Performance	5	2574	November 1, 2012

How to program different behaviors of 4 partitions in 1 SM in ADA arch?

Related topics