How can I get the exact CPU and GPU time in NSYS NVTX profiling?

I just ran a profiling on LLAMA2 to identify potential bottlenecks. How do I get the exact CPU only execution times and GPU only execution times from the NSYS profiling results?

What is the difference between NVTX ranges included in the CUDA HW section and inside the python thread? Is it the same?

Is the duration for each NVTX range inclusive of the CPU and GPU execution time?

In the CUDA API section, for example does cudaMalloc take into account execution time inside the GPU?
Does it wait for the GPU to finish?
Does this give the GPU execution time for the operation?

I’m going to try to answer as many of these as I can.

Firstly, there are statistical analysis scripts available in the GUI. If you go to the drop down shown as “Events View” in your screen shot you will have a statistical analysis option. Using the existing scripts there, you should be able to get what you want. If not, you can modify those scripts as well.

Secondly, NVTX is a CPU side api. What you see on the thread is the actual time that the NVTX range was running on the CPU. The NVTX you see on the GPU timeline is a projection, basically the ranges there are a graphical representation of which of the GPU work was launched during that range. Therefore it is inclusive of the CPU on the CPU side and the GPU on the GPU side (although the GPU is less precise).

The CUDA API ranges on the CPU are the time that the CPU spent executing that code. Whether it waits for the GPU to finish depends on the API in question. The GPU time for the underlying kernels that are executed are better explained on the GPU timeline.

Thanks for the quick reply. I have a few more questions as I will be using these for my dissertation.

To get the exact GPU time, can we trace back all the kernel functions called inside a specific NVTX range listed in “CUDA Kernel Launch & Exec Time Trace” and sum up “Kernel Dur” or “Total Dur” values?

Does give us the exact GPU execution time for a specific NVTX range?
All the remaining time indicates that the GPU was idle if my understanding is correct i.e. GPU Idle time?

Is the “Duration” mentioned in the ‘CUDA GPU Trace’ inclusive of the CPU and GPU or is it only GPU execution time?

Also, how do we find idle GPU times and CPU only execution time? Does ‘NVTX GPU Projection Trace’ help?

If you change the dropdown in the event view to “expert systems” you will find cpu and gpu starvation rules. This will allow you to find places where the relevant hardware is idle. You can use the settings to change the length of idle that triggers being noticed here.

Cuda GPU trace knows nothing about the CPU time.

I think you might want to read this blog I wrote - https://developer.nvidia.com/blog/understanding-the-visualization-of-overhead-and-latency-in-nsight-systems/

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.