Cache data invalidation between kernel calls


I guess cache data is invalidated after finishing each kernel invocation. The reason is that GPU doesn’t know whether values in main memory are changed by CPU or not. Is this right? If so, is there anyway to change this behavior?

You should be able to issue an uncached ld/st followed by a system-wide memory fence to make accesses by an SM avoid the cache and become visible to the CPU.

See __threadfence_system() and the ‘volatile’ keyword in the programming guide for more info.

You can also control this on a finer granularity with inline assembly.

Thanks for the reply. I’d like to keep the data remain in cache between kernels calls rather than using uncached ld.

Sorry, I think I misread your question, I thought you were asking how to make CPU writes visible to the GPU before finishing a kernel.

Hopefully this text from the PTX manual about the default cache policy answers you actual question:

“Cache at all levels, likely to be accessed again.
The default load instruction cache operation is, which allocates cache lines in all levels (L1
and L2) with normal eviction policy. Global data is coherent at the L2 level, but multiple L1
caches are not coherent for global data. If one thread stores to global memory via one L1 cache,
and a second thread loads that address via a second L1 cache with, the second thread may
get stale L1 cache data, rather than the data stored by the first thread. The driver must
invalidate global L1 cache lines between dependent grids of parallel threads.
Stores by the first
grid program are then correctly fetched by the second grid program issuing default loads
cached in L1.”

So only the L1s (not the L2) should be invalidated between dependent kernels. Also note that the
L1s are write-through by default for global data:

“The default store instruction cache operation is st.wb, which writes back cache lines of coherent
cache levels with normal eviction policy. Data stored to local per-thread memory is cached in L1
and L2 with with write-back. However, sm_20 does NOT cache global store data in L1 because
multiple L1 caches are not coherent for global data. Global stores bypass L1, and discard any L1
lines that match, regardless of the cache operation.
Future GPUs may have globally-coherent L1
caches, in which case st.wb could write-back global store data from L1.”

So the L1s are invalidated, but not written back (the L2 already has the most current value for global data,
and local data is dead after the kernel finishes).

This is very helpful. Thanks!

I assume that you were trying to exploit temporal locality of L1 cache.
What kind of application you were investigating?