In case of using peer memory, How can I measure the L1 or L2 cache's value on operating GPU?

cudaMancpy · February 25, 2025, 4:17am

I ran a program that reads data from GPU1 and computes the data on GPU0 using peer memory.
On nsys (NSight Systems), I checked that data is moving from GPU1 to GPU0 through NVLINK.
However, On ncu-ui (Nsight Compute GUI), I couldn’t have observed that data is moving from peer memory.

It seems that ncu measures peer memory data movement through L2 cache miss counter.
Thus, I also checked L1 cache miss counter.
Both the L2 cache value and the L1 cache value was measured as 0 on the GPU0.
Since the operation results come out correctly and data movement is observed through nsys, I think that the data is transfered from GPU1’s memory (peer memory).

I have two questions.

How can I measure data movement from/to peer memory “through the ncu”?
How can I enable caching data from peer memory?

I also used these metrics to ncu’s CLI version.
all of Metric Values were 0.
lts__t_requests_srcunit_l1_aperture_peer
lts__t_requests_srcunit_l1_aperture_peer_evict_first
lts__t_requests_srcunit_l1_aperture_peer_evict_first_lookup_hit
lts__t_requests_srcunit_l1_aperture_peer_evict_first_lookup_miss
lts__t_requests_srcunit_l1_aperture_peer_evict_last
lts__t_requests_srcunit_l1_aperture_peer_evict_last_lookup_hit
lts__t_requests_srcunit_l1_aperture_peer_evict_last_lookup_miss
lts__t_requests_srcunit_l1_aperture_peer_evict_normal
lts__t_requests_srcunit_l1_aperture_peer_evict_normal_demote
lts__t_requests_srcunit_l1_aperture_peer_evict_normal_demote_lookup_hit
lts__t_requests_srcunit_l1_aperture_peer_evict_normal_demote_lookup_miss
lts__t_requests_srcunit_l1_aperture_peer_evict_normal_lookup_hit
lts__t_requests_srcunit_l1_aperture_peer_evict_normal_lookup_miss
lts__t_requests_srcunit_l1_aperture_peer_lookup_hit
lts__t_requests_srcunit_l1_aperture_peer_lookup_miss
lts__t_requests_srcunit_l1_aperture_peer_op_atom
lts__t_requests_srcunit_l1_aperture_peer_op_atom_dot_alu
lts__t_requests_srcunit_l1_aperture_peer_op_atom_dot_alu_lookup_hit
lts__t_requests_srcunit_l1_aperture_peer_op_atom_dot_cas
lts__t_requests_srcunit_l1_aperture_peer_op_atom_dot_cas_lookup_hit
lts__t_requests_srcunit_l1_aperture_peer_op_atom_lookup_hit
lts__t_requests_srcunit_l1_aperture_peer_op_atom_lookup_miss
lts__t_requests_srcunit_l1_aperture_peer_op_membar
lts__t_requests_srcunit_l1_aperture_peer_op_membar_lookup_hit
lts__t_requests_srcunit_l1_aperture_peer_op_membar_lookup_miss
lts__t_requests_srcunit_l1_aperture_peer_op_read
lts__t_requests_srcunit_l1_aperture_peer_op_read_lookup_hit
lts__t_requests_srcunit_l1_aperture_peer_op_read_lookup_miss
lts__t_requests_srcunit_l1_aperture_peer_op_red
lts__t_requests_srcunit_l1_aperture_peer_op_red_lookup_hit
lts__t_requests_srcunit_l1_aperture_peer_op_red_lookup_miss
lts__t_requests_srcunit_l1_aperture_peer_op_write
lts__t_requests_srcunit_l1_aperture_peer_op_write_lookup_hit
lts__t_requests_srcunit_l1_aperture_peer_op_write_lookup_miss
lts__t_sectors_srcunit_l1_aperture_peer
lts__t_sectors_srcunit_l1_aperture_peer_evict_first
lts__t_sectors_srcunit_l1_aperture_peer_evict_first_lookup_hit
lts__t_sectors_srcunit_l1_aperture_peer_evict_first_lookup_miss
lts__t_sectors_srcunit_l1_aperture_peer_evict_last
lts__t_sectors_srcunit_l1_aperture_peer_evict_last_lookup_hit
lts__t_sectors_srcunit_l1_aperture_peer_evict_last_lookup_miss
lts__t_sectors_srcunit_l1_aperture_peer_evict_normal

felix_dt · February 27, 2025, 12:41pm

I seems you already found your answer why there is no peer traffic shown for nvlink data here. Peer traffic is for PCIe-connected GPUs, it does not count nvlink traffic. This is shown in the NVLink section.

To collect the nvlink section, use --set nvlink or --set full --section Nvlink.

cudaMancpy · March 4, 2025, 2:24am

Thank you for answering me how to measure it.
I would like to ask you again what I posted as the second question above.
NVLink measurements can be made possible and I can calculate the data movements, but I wonder if Peer Memory is being cached by default.
If it’s not working, how can I check it and how can I enable it?

veraj · March 18, 2025, 2:25am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
In case of using peer memory, How can I measure the L1 or L2 cache’s value on operating GPU? CUDA Programming and Performance	3	84	October 7, 2025
Problems with lts__t_requests_srcunit_tex_aperture_peer Nsight Compute	7	795	March 18, 2025
peer-to-peer copy using cuMemcpy rather than cuMemcpyPeer CUDA Programming and Performance	1	2176	August 9, 2011
cuda 4.0rc2 cudaMemcpyPeer(Async) performance issues CUDA Programming and Performance	11	13132	May 3, 2011
How can I check and see if my GPU is using L1 cache CUDA Programming and Performance	7	3062	June 9, 2011
Peer-to-Peer access GPUDirect CUDA Programming and Performance	0	6146	August 8, 2011
nvvp: how to count DRAM memory accesses ? in Fermi. CUDA Programming and Performance	1	1937	March 26, 2012
identical code on multiple GPUs attached to the same board. how to do p2p memaccess? CUDA Programming and Performance	2	957	June 12, 2013
How to count memory requests? as reported in nsight analysis CUDA Programming and Performance	0	855	May 31, 2012
L1 cache hits 0% CUDA Programming and Performance	2	1143	June 1, 2013

In case of using peer memory, How can I measure the L1 or L2 cache's value on operating GPU?

Related topics