PCIe RX throughput rises strangely when runing multiple streams in cuda graph

when I launched a manually designed CUDAGraph which is composed of two branches (streams), PCIe RX throughput rising so quickly compared to that only included single stream. I was confused …

By the way, what does metrics PCIe Read Requests to BAF mean and how can I verify which system factors have an effect on PCIe Bandwidth

@liuyis can you please take a look at this issue for the cuda graph question

Meanwhile, what version of Nsys are you using?

@user122022 Could you share the report file for us to take a closer look?

Regarding to PCIe Read Request to BAF, the name is actually PCIe Read Requests to BAR1, and if you hover over the name, you can see the description CPU+Peer Reads from VRAM over PCIe