How to program different behaviors of 4 partitions in 1 SM in ADA arch?

According to page 11 of https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf, there are four partitions in each SM. My understanding is that each partition is mapped to one warp at a time, so there can be four concurrent warps at the same time. Is it possible to program in CUDA to specify different behavior for the warp from a specific partition? For example, I want all and only threads from the first partition to load data from global and cache in L1 (using ca modifier) while threads from other partitions to not update L1 (using lu modifier). How can I achieve this?

Thanks

No, there is no way to direct operations at this level. Greg’s SM overview gives a good explanation of operation - warps can change on a cycle by cycle basis.

To what end would you choose to specialize work on a partition?

I ask as specialization per SM sub-partition (SMSP in profiler terms) will almost always results in lower throughput. Warp specialization as used by cutlass library specializes the work done by specific warps in a thread block with ultimate goal to have equal specialized work on each SM sub-partition. This allows the fewest number of warps to use the most resources while optimizing throughput, reducing communication overhead, and increasing determinism.

You can try to detect the SM partition and then call a different function at the top-level of your kernel.
(I think with most GPUs it is (warpid % 4)?
https://docs.nvidia.com/cuda/parallel-thread-execution/#special-registers-warpid

Each warp on the SM is assigned to one of the 4 partitions. And the scheduler of each partition selects one available warp of the assigned ones.

Thank you for the reply and good information. I don’t have any real use case. I was just studying the document I linked and tried to create some tests for better understanding of the architecture.

Thank you! This overview is very helpful.