What happens to the GPU cache at the end of the kernel?

arishop · October 3, 2016, 12:51pm

I wonder what happens to the content of the GPU L2 cache when the kernel is over, i.e., after cudaDeviceSynchronize()? Is it flushed and then invalidated? For example, if I have two kernels that run consecutively and the output of Kernel1 is passed as the input to Kernel2, then can I say that Kernel2 may already find the input in the cache without needing to read it from the memory? Of course, assuming that the data can fit into the cache and it’s not evicted by other lines.

njuffa · October 3, 2016, 3:29pm

All non-coherent caches (e.g. texture cache) are invalidated when a kernel launch is initiated, otherwise stale data in these caches would cause incorrect operation of the kernel. Whether any coherent caches are invalidated as well I do not know. The cache design of GPUs has changed considerably across architecture generations, so there might not be one answer that applies to all of them.

What prompts you to ask the question? If you had precise knowledge of the cache invalidation behavior, how would it help?

arishop · October 3, 2016, 3:46pm

You are absolutely right, what prompted me to ask the question was that the next generation of SoCs will most likely include fully coherent GPUs, and I thought it would be wasteful if the cache content is invalidated.

njuffa · October 3, 2016, 4:14pm

I think we can safely assume that the architects at NVIDIA would not needlessly leave performance on the table. As to whether next-generation GPUs will sport fully-coherent memory subsystems: I do not know, and based on historical precedent, nobody from NVIDIA will comment on future hardware that is not shipping.

arishop · October 3, 2016, 4:24pm

When I talked to the Nvidia guys in the GTC conference few days ago they said that the next Jetson board will have fully coherent GPU. And I think this is a reasonable assumption considering that ARM will also start shipping fully coherent SoCs with its new CCI-550 interconnect. But as you say we have to wait and see.

njuffa · October 3, 2016, 4:43pm

Out of curiosity: Which GTC conference was that? I wasn’t aware that there was one this past week.

arishop · October 3, 2016, 4:56pm

GTC Europe in Amsterdam: [url]https://www.gputechconf.eu/Home.aspx[/url]

arishop · October 4, 2016, 8:55am

That’s interesting! Could you mention what was your experimental platform?

Robert_Crovella · September 30, 2020, 5:02pm

This is the experiment I had in mind:

$ cat t1819.cu
#include <stdio.h>

__global__ void w(int *data, const int val, const int sz){

  for (int i = threadIdx.x+blockDim.x*blockIdx.x; i< sz; i+=gridDim.x*blockDim.x)
    data[i] = val;
}

__global__ void r(int *data, int *r, const int sz){
  int val;
  for (int i = threadIdx.x+blockDim.x*blockIdx.x; i< sz; i+=gridDim.x*blockDim.x)
    val += data[i];
  if (val == 0) *r = val;
}


int main(){

  const int s = 1024*1024;  // 1M
  const int sz = s*sizeof(int);  // 4MB
  int *d1, *d2, *res;
  cudaMalloc(&d1, sz*10);
  cudaMalloc(&d2, sz*10);
  cudaMalloc(&res, sizeof(int));
  cudaMemset(d1, 1, sz);
  cudaMemset(d2, 1, sz);
  w<<<160,1024>>>(d2, 1, s);
  r<<<160,1024>>>(d1, res, s);
  w<<<160,1024>>>(d1, 1, s);
  r<<<160,1024>>>(d1, res, s);
  cudaDeviceSynchronize();
}
$ nvcc -arch=sm_70 -o t1819 t1819.cu -lineinfo 
$ CUDA_VISIBLE_DEVICES="0" ncu --cache-control none --metrics lts__t_request_hit_rate.pct ./t1819
==WARNING== Note: Running with uncontrolled GPU caches. Profiling results may be inconsistent.
==PROF== Connected to process 13555 (/home/nvidia/bobc/misc/t1819)
==PROF== Profiling "w(int*, int, int)" - 1: 0%....50%....100% - 1 pass
==PROF== Profiling "r(int*, int*, int)" - 2: 0%....50%....100% - 1 pass
==PROF== Profiling "w(int*, int, int)" - 3: 0%....50%....100% - 1 pass
==PROF== Profiling "r(int*, int*, int)" - 4: 0%....50%....100% - 1 pass
==PROF== Disconnected from process 13555
[13555] t1819@127.0.0.1
  w(int*, int, int), 2020-Sep-30 09:58:47, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    lts__t_request_hit_rate.pct                                                          %                          99.95
    ---------------------------------------------------------------------- --------------- ------------------------------

  r(int*, int*, int), 2020-Sep-30 09:58:47, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    lts__t_request_hit_rate.pct                                                          %                           0.49
    ---------------------------------------------------------------------- --------------- ------------------------------

  w(int*, int, int), 2020-Sep-30 09:58:47, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    lts__t_request_hit_rate.pct                                                          %                          86.62
    ---------------------------------------------------------------------- --------------- ------------------------------

  r(int*, int*, int), 2020-Sep-30 09:58:47, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    lts__t_request_hit_rate.pct                                                          %                          99.96
    ---------------------------------------------------------------------- --------------- ------------------------------

Topic		Replies	Views
GPU cache coherence problem CUDA Programming and Performance	7	5565	October 31, 2019
Texture memory coherency with respect to concurrent kernel execution CUDA Programming and Performance	4	3821	June 9, 2011
Question about memory flush and synchronization memory flush and synchronization CUDA Programming and Performance	6	4712	July 23, 2008
Overlapping kernel execution and memory copy CUDA Programming and Performance	6	9733	September 22, 2007
Why doesn't overlapping data transfers and kernel execution work here? CUDA Programming and Performance	88	293	April 8, 2025
What are the start, end, and pause signs of the indicator "elapsed cycles" count in the nsight compute report? Nsight Compute cuda , kernel	13	982	February 14, 2024
Cache coherence of pinned memory on Jetson Xavier NX Jetson Xavier NX cuda	4	1677	November 17, 2021
Cache data invalidation between kernel calls CUDA Programming and Performance	5	5520	August 22, 2013
Persistent Kernel does not work properly on some GPUs CUDA Programming and Performance cuda , kernel , nvbugs	6	2708	September 9, 2023
CUDA/OpenCL runs multiple GPUs sequentially CUDA Programming and Performance	16	19356	November 26, 2015

What happens to the GPU cache at the end of the kernel?

Related topics