Runtime too long with ncu, would like real-time profiling

Would like to get measurements during 3 seconds worth of nn inference execution using trtexec. Measurements I would like are GPU % utilization. % memory utilization, Tx, Rx, PCIe Tx/Rx % of peak, power and temp. Also CPU % utilization and % memory utilization. Also would like to measure storage I/O Tx and Rx.

ncu -o profile ./trtexec is very slow.

how would you suggest I proceed? Any suggestions appreciated, thank you.

Are you on embedded or x86/Power/SBSA?

Are you talking about Nsys or Ncu?

running on x86 … trying to understand if I should be using Nsys or Ncu

I’m going to move this to the Ncu topic to get you the best feedback.

ok, had meant to also add Nsys as a topic, so yes Ncu and Nsys tools are of interest

Nsys will be faster, but you won’t get all of the data that you are asking for. But it might be wise to start with a first pass there to see what is going on.


ok … just reinstalling the drivers on my server now so can’t try nsys at this moment but would like to get an understanding of Nsys and Ncu to try over the weekend… Am just looking at the Nsys documentation now and see lots of options for trace … do you know of a good example of usage for a combo of NSys and Ncu that would provide as many of those metrics above as possible [GPU % utilization. % memory utilization, Tx, Rx, PCIe Tx/Rx % of peak, power and temp. CPU % utilization and % memory utilization. storage I/O Tx and Rx.] ?

If you have a Turing class or later GPU, you will want to try the nsys GPU metrics collection functionality. Check out the nsys CLI’s ‘nsys profile --help’ command, look for the --gpu-metrics-device switch. Collecting GPU metrics with nsys will give you a lot of the data you are looking for. Try the ‘sudo nsys profile --gpu-metrics-device= [app args]’ command to collect data.

This command will collect the GPU metrics, CPU IP/backtrace samples and trace CPU context switches (giving you CPU utilization and thread behavior), CUDA, OpenGL, NVTX, and OSRT (OS runtime libs) which should give you good insight into how your app is using the CPUs and GPUs on your system.

Ok great to know this. Am using T4, A2 and 30 GPUs, so will try these metrics and that command line out shortly. Thank you very much for the helpful suggestions!