Cache Operators broken in Cuda Toolkit 12.4

Hello,

I have an application where I’m benchmarking the bandwidth and performance of certain models, and to do so I’m aiming to force all data to L2 by disabling L1 cache.
When I compile my code with 12.0 the behaviour is as intended, using the cache operators -Xptxas -dlcm=cv -Xptxas -dscm=wt. This can be seen below in the first figure as all of the traffic goes through L1 and L2.
imagem
And the nvcc version:
imagem

But when using 12.4 the cache operators seem to have no effect, with the following result:
imagem
Every configuration of the cache operators seem to have the same results, and as such no effect on the actual program. The version of nvcc used for the second scenario:
imagem

I would like to be able to replicate the behaviour of 12.0 in 12.4 as I intend to use the 2024.1 version of ncu, and if someone could please give me a hand, I’d be very much grateful!

It is possible for L1 hardware to be put to use in service of the read-only cache. This type of usage shows up in the profiler as L1 activity but has a different kind of SASS instruction associated with it, not subject to cache operators. Whether that is happening in your case, or not, cannot be determined with no code. Since this is an implementation detail (which path/SASS instruction to use) it’s possible that compiler behavior changed in this respect between CUDA 12.0 and CUDA 12.4