RdBW and WrBW measurement

Hi All

We have lots of CUDA kernels supporting multiple inference nets in our application. They are running on multiple GPUs (2080s), slotted into regular skylake PC. We would like to measure total Rd and Wr data from DDR for each GPU and also p2p traffic over PCIe (not using NvLnk), over certain duration of our run.

Which tools should we use ? nvvs ? Are there performance counters that we can directly read and report such that we can get an estimate rd/wr contribution of each internal module ? If these counters are hidden and not exposed - can we get to know the right API / libraries to get the info we want ?

Thanks.