I noticed that the libcudacxx library has implemented the extensions for cuda::access_property and the corresponding PTX instructions have also been defined.
Users can create and apply different L2 evict policy attributes to complement memory access instructions such as ld, st, and cp.async, to more fine-grained cache behavior.
The policies are divided into three modes: range base, fraction base, and compatible with L2 Persistence Cache, and the granularity of access control for memory blocks can range from 512KB to 4GB.
But, I have not yet seen any use cases for this flexible usage in programs. Can you tell me in which applications this type of usage can have big performance impact? I understand that there must have been some consideration when designing GPU Arch and PTX ISA.