Hello,
I have an application where I’m benchmarking the bandwidth and performance of certain models, and to do so I’m aiming to force all data to L2 by disabling L1 cache.
When I compile my code with 12.0 the behaviour is as intended, using the cache operators -Xptxas -dlcm=cv -Xptxas -dscm=wt
. This can be seen below in the first figure as all of the traffic goes through L1 and L2.
And the nvcc
version:
But when using 12.4 the cache operators seem to have no effect, with the following result:
Every configuration of the cache operators seem to have the same results, and as such no effect on the actual program. The version of nvcc used for the second scenario:
I would like to be able to replicate the behaviour of 12.0 in 12.4 as I intend to use the 2024.1 version of ncu, and if someone could please give me a hand, I’d be very much grateful!