Evicting lines from the cache during kernel execution. Possible?

bjoernf · August 29, 2018, 9:40am

Hi,

I was wondering if there is a way to invalidate specific cache lines in the L2 cache on the Tegra (TX1, TX2) during kernel execution? Or, if this is not available, is it possible to flush the full L2 cache during kernel execution?

Some background information:

I have a program that is divided into sequential intervals of prefetching and execution phases. During the prefetching phase, all data that is to be used during the current interval is prefetched to the L2 (using the prefetch.global.L2 PTX instruction). After this, there is a synchronization before switching to the Execution phase. During the execution phase, I want all data that was previously prefetched to be available in local memory, i.e., I want to experience zero L2 cache misses during the execution phase. In fact, the experiencing of zero cache misses is my main goal, even if there is no immediate performance gain.

However, due to the non-LRU replacement policy of the L2 cache (according to Mei [1]), I expect some of the prefetched data to be evicted before it can be used, as it is randomly chosen for eviction to make place for another prefetched cache line, already during the prefetch phase.

This effect has previously been identified on the ARM caches on the Tegra TX1, which also have a non-LRU replacement policy. However, as outlined by (among others) Matějka [2], the ARM caches prioritize the storing of data to invalid cache ways first, before evicting any valid cache lines. Because of this, it is possible to reduce the evictions of prefetched data by first invalidating the no longer needed cache lines in the cache, making sure that there are enough invalid lines for the next prefetch phase to use.

I am not sure it is possible to achieve something similar on the Tegra TX1/TX2, but the first questions I need to find the answers to are:

Does the GPU cache similarly prioritize the use of invalid cache ways when loading new data into the cache?
In that case, is it possible to invalidate individual cache lines (in the L2) during kernel execution?
If not, is it possible to flush the entire L2 cache during kernel execution?

Thanks a lot in advance for any answers or pointers regarding these questions.
Bjoern

[1] Xinxin Mei, et al, “Dissecting GPU Memory Hierarchy through Microbenchmarking”, [1509.02308] Dissecting GPU Memory Hierarchy through Microbenchmarking
[2] Joel Matějka, et al, Combining PREM compilation and ILP scheduling for high-performance and predictable MPSoC execution", https://dl.acm.org/citation.cfm?id=3178444

Bibek · September 3, 2018, 11:04am

Hi,

TX1 and TX2 is ARMv8 cpu implementation. So, whatever applies to ARM does applies to TX1 and TX2.
To invalidate a cache line, you can try using __inval_cache_range
dsb should flush the cache in your code.

thanks
Bibek

Topic		Replies	Views
Evicting lines from the cache during kernel execution. Possible? CUDA Programming and Performance	2	775	August 29, 2018
Tegra K1 L2 Cache Question Jetson TK1	1	765	November 24, 2015
Cache data invalidation between kernel calls CUDA Programming and Performance	5	5624	August 22, 2013
How can I check and see if my GPU is using L1 cache CUDA Programming and Performance	7	3020	June 9, 2011
L1 Cache, L2 Cache and Shared memory in Fermi CUDA Programming and Performance	5	23604	March 21, 2011
CPU cache vs. GPU shared memory CUDA Programming and Performance	3	12808	March 1, 2010
Flushing Instruction Cache on GPU CUDA Programming and Performance	6	16062	June 4, 2010
Why L1 cache hit ratio become zero on K20? CUDA Programming and Performance	10	5681	January 17, 2013
Bypassing cache in Fermi CUDA Programming and Performance	16	4834	August 28, 2010
[Jetson-TK1] How to measure DRAM <-> L2 R/W bandwidth on Tegra K1? Jetson TK1	3	1696	August 12, 2015

Evicting lines from the cache during kernel execution. Possible?

Related topics