How to get the compute and memory throughput of GPU from the perspective of the whole GPU system

Hi~When I profile my cuda program or DL.inference, I could get the profiled compute and memory throughput for each kernel even they are in diverse processes or streams. But I want to get these metrics from the perspective of the whole GPU system rather than kernel level when I launch multiple kernels in different processes or streams. How can I do that… Thanks so much!

If you want to see activity from the entire device as multiple processes and kernels run, you may be looking for Nsight Systems. Take a look here and see if it’s what you are looking for User Guide :: Nsight Systems Documentation

I can get the DRAM throughput for each kernel but it seems that I cannot get that of the whole system even by Nsight System

If you collect GPU Metrics with Nsight Systems, you will get a row in the timeline for DRAM Bandwidth and Throughput (see below). This is for the entire device. There is more information about this here User Guide :: Nsight Systems Documentation

Are you able to collect those metrics or are you looking for something else?

got it and thanks so much