PCIe bandwidth information

I have a Cuda program that uses UVM to perform tasks. I have changed the code in order to support zero-copy access, since for efficient zero-copy access to work the PCIe bandwidth must be utilised as maximum as possible
Therefore I want to measure the maximum amount of data/bandwidth achieved by my program. I am not able to find any such function in ncu and nsys , can someone point me out to some resources or a way to do the same

you can take the total PCIE bytes transferred by your kernel and divide by the kernel duration. If you’re wondering how to get this data in a profiler, I suggest asking on one of the profiler forums.

Nsight Compute can measure the PCIe bytes transferred during execution of the kernel. The metrics are:


To get the achieved bandwidth you can use


Nsight Systems supports collecting GPU metrics at high rate and graphing them with CUDA trace. The default metrics set includes the two metrics listed above. This will allow you to view the data over time.

1 Like

Thank you for the information. Also, my program is sort of like a graph algorithm so it calls the supporting same kernel multiple times in the program to do some operation. Now the output for the above is shown for each kernel agent, so is there any way I can get some collective data for all kernels (like the maximum value of pice__read_bytes.sum across all kernels)?

Nsight Compute can collect metrics over a user defined range vs. each kernel. This will give you one measurement over the full workload. This will not given you some instantaneous maximum during that period.

I would suggest using Nsight Systems to look at how PCIe traffic changes over the duration of your workload.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.