How to use createpolicy ptx instruction well in CUDA? Are there any practical examples as reference?

Hello everyone,

I noticed that the libcudacxx library has implemented the extensions for cuda::access_property and the corresponding PTX instructions have also been defined.
Users can create and apply different L2 evict policy attributes to complement memory access instructions such as ld, st, and cp.async, to more fine-grained cache behavior.
The policies are divided into three modes: range base, fraction base, and compatible with L2 Persistence Cache, and the granularity of access control for memory blocks can range from 512KB to 4GB.
But, I have not yet seen any use cases for this flexible usage in programs. Can you tell me in which applications this type of usage can have big performance impact? I understand that there must have been some consideration when designing GPU Arch and PTX ISA.

Thank you!

The A100 whitepaper indicates possible use cases (on pp. 40-41):

For example, for DL inferencing workloads, ping-pong buffers can be persistently cached in the
L2 for faster data access, while also avoiding writebacks to DRAM. For producer-consumer
chains, such as those found in DL training, L2 cache controls can optimize caching across the
write-to-read data dependencies. In LSTM networks, recurrent weights that are shared across
multiple GEMM operations can be preferentially cached and reused in L2.

On page 66 there is a performance comparison for a simple test case (histogramming).