Problems with lts__t_requests_srcunit_tex_aperture_peer


I’m testing P2P memory access, I want to know when local GPU data cache miss, how many sectors will be transfered. So, I perform a kernel to read peer GPU memory and test lts__t_requests_srcunit_tex_aperture_peer metric, but I got a zero value.

Did I misunderstand something? And how to test peer memory access correctly by ncu?


The metric you’re looking at should give some information if there is peer reads. A request could be comprised of multiple sector reads, so in our memory calculations, we use this metric for peer accesses “lts__t_sectors_srcunit_tex_aperture_peer_lookup_miss.sum”. You could try collecting that. Please share a screenshot of the L2 Cache table from the Memory Workload Analysis like the one below. That will help us verify if peer accesses are actually occurring.

L2 Cache table:

I tried lts__t_sectors_srcunit_tex_aperture_peer_lookup_miss.sum metric, also got a zero value.

And below is the nvlink received bytes and transmitted bytes, which indicates data transfer actually happened.

Moreover, if I collect l1tex__m_xbar2l1tex_read_sectors, I can get a value match nvlink user received data.

I didn’t realize you were using NVLINK, as opposed to PCIe. The peer metrics are for PCIe connected GPUs. The NVLINK section is what will show the communication between GPUs in your system. The only thing to be aware of is that they are device-wide, so you want to make sure that you don’t have traffic generated by other applications running at the same time.