What happens to the GPU cache at the end of the kernel?

I wonder what happens to the content of the GPU L2 cache when the kernel is over, i.e., after cudaDeviceSynchronize()? Is it flushed and then invalidated? For example, if I have two kernels that run consecutively and the output of Kernel1 is passed as the input to Kernel2, then can I say that Kernel2 may already find the input in the cache without needing to read it from the memory? Of course, assuming that the data can fit into the cache and it’s not evicted by other lines.

All non-coherent caches (e.g. texture cache) are invalidated when a kernel launch is initiated, otherwise stale data in these caches would cause incorrect operation of the kernel. Whether any coherent caches are invalidated as well I do not know. The cache design of GPUs has changed considerably across architecture generations, so there might not be one answer that applies to all of them.

What prompts you to ask the question? If you had precise knowledge of the cache invalidation behavior, how would it help?

You are absolutely right, what prompted me to ask the question was that the next generation of SoCs will most likely include fully coherent GPUs, and I thought it would be wasteful if the cache content is invalidated.

I think we can safely assume that the architects at NVIDIA would not needlessly leave performance on the table. As to whether next-generation GPUs will sport fully-coherent memory subsystems: I do not know, and based on historical precedent, nobody from NVIDIA will comment on future hardware that is not shipping.

When I talked to the Nvidia guys in the GTC conference few days ago they said that the next Jetson board will have fully coherent GPU. And I think this is a reasonable assumption considering that ARM will also start shipping fully coherent SoCs with its new CCI-550 interconnect. But as you say we have to wait and see.

Out of curiosity: Which GTC conference was that? I wasn’t aware that there was one this past week.

GTC Europe in Amsterdam: https://www.gputechconf.eu/Home.aspx

That’s interesting! Could you mention what was your experimental platform?

This is the experiment I had in mind:

$ cat t1819.cu
#include <stdio.h>

__global__ void w(int *data, const int val, const int sz){

  for (int i = threadIdx.x+blockDim.x*blockIdx.x; i< sz; i+=gridDim.x*blockDim.x)
    data[i] = val;
}

__global__ void r(int *data, int *r, const int sz){
  int val;
  for (int i = threadIdx.x+blockDim.x*blockIdx.x; i< sz; i+=gridDim.x*blockDim.x)
    val += data[i];
  if (val == 0) *r = val;
}


int main(){

  const int s = 1024*1024;  // 1M
  const int sz = s*sizeof(int);  // 4MB
  int *d1, *d2, *res;
  cudaMalloc(&d1, sz*10);
  cudaMalloc(&d2, sz*10);
  cudaMalloc(&res, sizeof(int));
  cudaMemset(d1, 1, sz);
  cudaMemset(d2, 1, sz);
  w<<<160,1024>>>(d2, 1, s);
  r<<<160,1024>>>(d1, res, s);
  w<<<160,1024>>>(d1, 1, s);
  r<<<160,1024>>>(d1, res, s);
  cudaDeviceSynchronize();
}
$ nvcc -arch=sm_70 -o t1819 t1819.cu -lineinfo 
$ CUDA_VISIBLE_DEVICES="0" ncu --cache-control none --metrics lts__t_request_hit_rate.pct ./t1819
==WARNING== Note: Running with uncontrolled GPU caches. Profiling results may be inconsistent.
==PROF== Connected to process 13555 (/home/nvidia/bobc/misc/t1819)
==PROF== Profiling "w(int*, int, int)" - 1: 0%....50%....100% - 1 pass
==PROF== Profiling "r(int*, int*, int)" - 2: 0%....50%....100% - 1 pass
==PROF== Profiling "w(int*, int, int)" - 3: 0%....50%....100% - 1 pass
==PROF== Profiling "r(int*, int*, int)" - 4: 0%....50%....100% - 1 pass
==PROF== Disconnected from process 13555
[13555] t1819@127.0.0.1
  w(int*, int, int), 2020-Sep-30 09:58:47, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    lts__t_request_hit_rate.pct                                                          %                          99.95
    ---------------------------------------------------------------------- --------------- ------------------------------

  r(int*, int*, int), 2020-Sep-30 09:58:47, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    lts__t_request_hit_rate.pct                                                          %                           0.49
    ---------------------------------------------------------------------- --------------- ------------------------------

  w(int*, int, int), 2020-Sep-30 09:58:47, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    lts__t_request_hit_rate.pct                                                          %                          86.62
    ---------------------------------------------------------------------- --------------- ------------------------------

  r(int*, int*, int), 2020-Sep-30 09:58:47, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    lts__t_request_hit_rate.pct                                                          %                          99.96
    ---------------------------------------------------------------------- --------------- ------------------------------