In case of using peer memory, How can I measure the L1 or L2 cache’s value on operating GPU?

cudaMancpy · March 11, 2025, 11:27am

I left a question in the Nsight Compute category and received an answer about how to measure NVLink.
However, my main question is caching received data from peer memory.
I think this question falls into this category, so I’m leaving the same question again.

Q. Is there a way to see if it’s being cached?
(I mean, I also want to check the hit ratio for the data that was accessed from peer.)

Robert_Crovella · March 11, 2025, 2:01pm

Generally it should not be difficult to see the difference between local caching or not. Devise a test that repeatedly accesses the data (perhaps copying the data). If the data access occurs repeatedly at a slower rate, it is not being cached. The slower rate would be consistent with whatever the peer bus speed is - PCIE or NVLink. If the measured performance is consistent with a higher rate, then it would seem to be cached, locally.

Regarding using the profiler, the profiler has metrics for cache hit rates. It should be possible to devise a similar test to that described above, and then run with the profiler, and observe the hit metrics.

cudaMancpy · March 13, 2025, 2:39am

I checked the way you told me.
It seems no caching totaly.
(Remote GPU memory’s total data amount is 256MB. Received User Bytes value is for about 268MB.)

Peer Memory is enabled.
In this state, when the local GPU reads the data from the remote GPU memory, I thought caching occurred by default with the local GPU L2 cache.
However, it does not appear to be the result of the experiment.
It seems to be a one-time read only data without caching with direct access.

jumpin · October 7, 2025, 12:00pm

have you tried __ldg() ?

Topic		Replies	Views
In case of using peer memory, How can I measure the L1 or L2 cache's value on operating GPU? Nsight Compute	3	257	March 18, 2025
L1 cache hits 0% CUDA Programming and Performance	2	1152	June 1, 2013
How can I check and see if my GPU is using L1 cache CUDA Programming and Performance	7	3088	June 9, 2011
peer-to-peer copy using cuMemcpy rather than cuMemcpyPeer CUDA Programming and Performance	1	2186	August 9, 2011
identical code on multiple GPUs attached to the same board. how to do p2p memaccess? CUDA Programming and Performance	2	965	June 12, 2013
Texture cache hit rate statistics CUDA Programming and Performance	2	2863	November 21, 2008
Performance counter descriptions ? L2 cache hit/miss ratio missing ? CUDA Programming and Performance	0	2768	March 25, 2012
Fermi L2 cache How fast is the L2 cache? How do I access it? CUDA Programming and Performance	11	26332	December 2, 2011
Method to measure cache misses? Access times etc? CUDA Programming and Performance	1	1320	February 22, 2010
Nvprof and Nsight returning different results for L1 and L2 cache hit rates Visual Profiler and nvprof	0	849	July 8, 2019

In case of using peer memory, How can I measure the L1 or L2 cache’s value on operating GPU?

Related topics