Cache data invalidation between kernel calls

bowuwm · June 6, 2012, 4:37pm

Hi,

I guess cache data is invalidated after finishing each kernel invocation. The reason is that GPU doesn’t know whether values in main memory are changed by CPU or not. Is this right? If so, is there anyway to change this behavior?

Gregory_Diamos · June 6, 2012, 5:04pm

You should be able to issue an uncached ld/st followed by a system-wide memory fence to make accesses by an SM avoid the cache and become visible to the CPU.

See __threadfence_system() and the ‘volatile’ keyword in the programming guide for more info.

You can also control this on a finer granularity with inline assembly.

bowuwm · June 6, 2012, 5:46pm

Thanks for the reply. I’d like to keep the data remain in cache between kernels calls rather than using uncached ld.

Gregory_Diamos · June 7, 2012, 12:51am

Sorry, I think I misread your question, I thought you were asking how to make CPU writes visible to the GPU before finishing a kernel.

Hopefully this text from the PTX manual about the default cache policy answers you actual question:

“Cache at all levels, likely to be accessed again.
The default load instruction cache operation is ld.ca, which allocates cache lines in all levels (L1
and L2) with normal eviction policy. Global data is coherent at the L2 level, but multiple L1
caches are not coherent for global data. If one thread stores to global memory via one L1 cache,
and a second thread loads that address via a second L1 cache with ld.ca, the second thread may
get stale L1 cache data, rather than the data stored by the first thread. The driver must
invalidate global L1 cache lines between dependent grids of parallel threads. Stores by the first
grid program are then correctly fetched by the second grid program issuing default ld.ca loads
cached in L1.”

So only the L1s (not the L2) should be invalidated between dependent kernels. Also note that the
L1s are write-through by default for global data:

“The default store instruction cache operation is st.wb, which writes back cache lines of coherent
cache levels with normal eviction policy. Data stored to local per-thread memory is cached in L1
and L2 with with write-back. However, sm_20 does NOT cache global store data in L1 because
multiple L1 caches are not coherent for global data. Global stores bypass L1, and discard any L1
lines that match, regardless of the cache operation. Future GPUs may have globally-coherent L1
caches, in which case st.wb could write-back global store data from L1.”

So the L1s are invalidated, but not written back (the L2 already has the most current value for global data,
and local data is dead after the kernel finishes).

bowuwm · June 7, 2012, 4:18pm

This is very helpful. Thanks!

Sorry, I think I misread your question, I thought you were asking how to make CPU writes visible to the GPU before finishing a kernel.

Hopefully this text from the PTX manual about the default cache policy answers you actual question:

"Cache at all levels, likely to be accessed again.

The default load instruction cache operation is ld.ca, which allocates cache lines in all levels (L1

and L2) with normal eviction policy. Global data is coherent at the L2 level, but multiple L1

caches are not coherent for global data. If one thread stores to global memory via one L1 cache,

and a second thread loads that address via a second L1 cache with ld.ca, the second thread may

get stale L1 cache data, rather than the data stored by the first thread. [b]The driver must

invalidate global L1 cache lines between dependent grids of parallel threads.[/b] Stores by the first

grid program are then correctly fetched by the second grid program issuing default ld.ca loads

cached in L1."

So only the L1s (not the L2) should be invalidated between dependent kernels. Also note that the

L1s are write-through by default for global data:

"The default store instruction cache operation is st.wb, which writes back cache lines of coherent

cache levels with normal eviction policy. Data stored to local per-thread memory is cached in L1

and L2 with with write-back. [b]However, sm_20 does NOT cache global store data in L1 because

multiple L1 caches are not coherent for global data. Global stores bypass L1, and discard any L1

lines that match, regardless of the cache operation.[/b] Future GPUs may have globally-coherent L1

caches, in which case st.wb could write-back global store data from L1."

So the L1s are invalidated, but not written back (the L2 already has the most current value for global data,

and local data is dead after the kernel finishes).

chenxuhao · August 22, 2013, 5:53pm

I assume that you were trying to exploit temporal locality of L1 cache.
What kind of application you were investigating?

Topic		Replies	Views
L1 Cache, L2 Cache and Shared memory in Fermi CUDA Programming and Performance	5	23671	March 21, 2011
Memory programming model of Fermi CUDA Programming and Performance	12	5650	March 22, 2010
Texture Cache Coherency CUDA Programming and Performance	5	2208	April 28, 2009
What happens to the GPU cache at the end of the kernel? CUDA Programming and Performance	8	3005	September 30, 2020
Texture cache coherency CUDA Programming and Performance	13	11825	November 16, 2007
L1-L2-Global how to clearly describe their interaction for a given kernel CUDA Programming and Performance	3	2099	April 15, 2012
How to keep L1 and L2 cache consistent CUDA Programming and Performance	1	1375	October 27, 2011
maintaining contents of cache across kernel launches CUDA Programming and Performance	2	818	June 3, 2009
Global memory access requests ordered..? CUDA Programming and Performance	2	607	May 8, 2014
Fermi L1 Cache coherent? CUDA Programming and Performance	5	14974	May 20, 2010

Cache data invalidation between kernel calls

Related topics