Does a kernel write back its output data from cache to global memory when finishing executing?

A way to think about it is that the L2 cache is a proxy for device memory. Device memory accesses go through the L2 cache. Any access that goes through the L2 cache will read “updated values” as they appear in the cache.

cudaMemcpyHostToDevice → updates L2 cache (see here for an example)
Kernel1 → updates L2 cache
Kernel2 ← reads from L2 cache
cudaMemcpyDeviceToHost ← reads from L2 cache

To answer your question, the oldest data in the cache will be written out to device memory when it needs to make space for new data, according to the cache eviction policy.

1 Like