Evicting lines from the cache during kernel execution. Possible?


I was wondering if there is a way to invalidate specific cache lines in the L2 cache on the Tegra (TX1, TX2) during kernel execution? Or, if this is not available, is it possible to flush the full L2 cache during kernel execution?

Some background information:

I have a program that is divided into sequential intervals of prefetching and execution phases. During the prefetching phase, all data that is to be used during the current interval is prefetched to the L2 (using the prefetch.global.L2 PTX instruction). After this, there is a synchronization before switching to the Execution phase. During the execution phase, I want all data that was previously prefetched to be available in local memory, i.e., I want to experience zero L2 cache misses during the execution phase. In fact, the experiencing of zero cache misses is my main goal, even if there is no immediate performance gain.

However, due to the non-LRU replacement policy of the L2 cache (according to Mei [1]), I expect some of the prefetched data to be evicted before it can be used, as it is randomly chosen for eviction to make place for another prefetched cache line, already during the prefetch phase.

This effect has previously been identified on the ARM caches on the Tegra TX1, which also have a non-LRU replacement policy. However, as outlined by (among others) Matějka [2], the ARM caches prioritize the storing of data to invalid cache ways first, before evicting any valid cache lines. Because of this, it is possible to reduce the evictions of prefetched data by first invalidating the no longer needed cache lines in the cache, making sure that there are enough invalid lines for the next prefetch phase to use.

I am not sure it is possible to achieve something similar on the Tegra TX1/TX2, but the first questions I need to find the answers to are:

  1. Does the GPU cache similarly prioritize the use of invalid cache ways when loading new data into the cache?
  2. In that case, is it possible to invalidate individual cache lines (in the L2) during kernel execution?
  3. If not, is it possible to flush the entire L2 cache during kernel execution?

Thanks a lot in advance for any answers or pointers regarding these questions.

[1] Xinxin Mei, et al, “Dissecting GPU Memory Hierarchy through Microbenchmarking”, https://arxiv.org/abs/1509.02308
[2] Joel Matějka, et al, Combining PREM compilation and ILP scheduling for high-performance and predictable MPSoC execution", https://dl.acm.org/citation.cfm?id=3178444


TX1 and TX2 is ARMv8 cpu implementation. So, whatever applies to ARM does applies to TX1 and TX2.
To invalidate a cache line, you can try using __inval_cache_range
dsb should flush the cache in your code.