According to page 11 of https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf, there are four partitions in each SM. My understanding is that each partition is mapped to one warp at a time, so there can be four concurrent warps at the same time. Is it possible to program in CUDA to specify different behavior for the warp from a specific partition? For example, I want all and only threads from the first partition to load data from global and cache in L1 (using ca modifier) while threads from other partitions to not update L1 (using lu modifier). How can I achieve this?
No, there is no way to direct operations at this level. Greg’s SM overview gives a good explanation of operation - warps can change on a cycle by cycle basis.
To what end would you choose to specialize work on a partition?
I ask as specialization per SM sub-partition (SMSP in profiler terms) will almost always results in lower throughput. Warp specialization as used by cutlass library specializes the work done by specific warps in a thread block with ultimate goal to have equal specialized work on each SM sub-partition. This allows the fewest number of warps to use the most resources while optimizing throughput, reducing communication overhead, and increasing determinism.
Thank you for the reply and good information. I don’t have any real use case. I was just studying the document I linked and tried to create some tests for better understanding of the architecture.