How can I profile both kernel and cuda APIs hardware usage and application total duration

I want to measure duration of application by using nsight system or nsight compute not just single kernel but total application workload. I already knew that NCU is just for single kernel and also serialize kernels in application. So how can I get exact time of application?

Also curious about there are method that nsight compute profile CUDA API’s hardware usage not just a kernel?
For example, there is any method that profile hardware usage like DRAM read or write that cudamemcpy API call?

I also want that ‘range replay’ can measure total duration or cuda API’s HW usage.
If then, If I want profile total performance or metrics about application, just insert top of code with cudaprofilestart () and bottom with cudaprofilerend() right?

Hi, @gju06051

Thanks for using our tools ! Please check if API statistics can meet your requirement 3. Nsight Compute — NsightCompute 12.5 documentation

Thank you for your comment. But why this ability not supported in Nsight Compute CLI?

And also curious about this value of API duration is same as nsight system? Your comment link say can’t replaceable of this API statistics result in Nsight Compute with Nsight System.

Hi, @gju06051

For API statistics, it is supported in interactive-profile. For Nsight Compute CLI, it is not supported .

Yes, as the doc said, “Note that this view cannot be used as a replacement for Nsight Systems when trying to optimize CPU performance of your application”.

For tool selection, you can refer https://developer.nvidia.com/tools-overview

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.