I think it should be possible to test. Run nsight compute, make sure to provide the proper cache invalidation override options to nsight compute, then look at a hit metric.
If you are looking for documentation, I don’t think it is documented anywhere by NVIDIA what the specified behavior is.
The only situation I know of where L2 gets “bypassed” is for transfers from sys mem (i.e. pinned host memory) initiated by SM (device code) activity. And I imagine that could change in the future.
Here is an example of a test:
# cat t368.cu
#include <iostream>
template <typename T>
__global__ void copy_k(T *in, T *out, size_t n){
for (size_t i = blockIdx.x*blockDim.x+threadIdx.x; i < n; i += gridDim.x*blockDim.x)
out[i] = in[i];
}
using mt = int;
int main(){
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
std::cout << "Device 0: " << prop.name << std::endl;
int l2cs = prop.l2CacheSize;
std::cout << "L2 Cache Size: " << l2cs << std::endl;
int nsm = prop.multiProcessorCount;
std::cout << "Multiprocesors: " << nsm << std::endl;
size_t s = l2cs/4;
size_t sz = s/sizeof(mt);
s = sz*sizeof(mt);
std::cout << "Test Buffer Size: " << s << std::endl;
mt *in1, *out1, *out2, *r;
cudaMalloc(&in1, s);
cudaMalloc(&out1, s);
cudaMalloc(&out2, s);
r = new mt[sz];
cudaMemset(in1, 1, s);
cudaMemcpy(out1, in1, s, cudaMemcpyDeviceToDevice);
copy_k<<<nsm,512>>>(out1, out2, sz);
cudaMemcpy(r, out2, s, cudaMemcpyDeviceToHost);
}
# nvcc -arch=sm_89 -o t368 t368.cu
# ncu --cache-control none --metrics lts__t_sector_op_read_hit_rate.pct ./t368
==WARNING== Note: Running with uncontrolled GPU caches. Profiling results may be inconsistent.
==PROF== Connected to process 42154 (/root/bobc/t368)
Device 0: NVIDIA L4
L2 Cache Size: 50331648
Multiprocesors: 58
Test Buffer Size: 12582912
==PROF== Profiling "copy_k" - 0: 0%....50%....100% - 1 pass
==PROF== Disconnected from process 42154
[42154] t368@127.0.0.1
void copy_k<int>(T1 *, T1 *, unsigned long) (58, 1, 1)x(512, 1, 1), Context 1, Stream 7, Device 0, CC 8.9
Section: Command line profiler metrics
---------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
---------------------------------- ----------- ------------
lts__t_sector_op_read_hit_rate.pct % 99.98
---------------------------------- ----------- ------------
#
CUDA 12.2
Based on that, it seems to me like the D2D operation is populating the L2 cache.