NVTX with GPU timing?

user33875 · October 3, 2023, 7:41pm

I have read several posts and it seems that NVTX timings are traced entirely on CPU. Is it possible to have a GPU-timing version of NVTX?

For example, say I want to know how long it takes for a list of kernels without device synchronization, then I could record two GPU events at the beginning and the end and use CUDA API to get their time difference. I am also trying to visualize it in a timeline similar to what we can have with nvtxRangePush/nvtxRangePop. But using nvtx naively won’t work since the timing it records is CPU times not GPU.

Is it possible to configure NVTX to use GPU timing? Or are there any alternative tools that are more suitable for this scenario?

Thanks in advance.

hwilper · October 3, 2023, 8:22pm

NVTX is a CPU only API, it literally doesn’t exist on the GPU side. Now, what Nsys does in our GUI is that we “project” the NVTX ranges onto the GPU. Literally we map the CUDA kernels that are inside the NVTX range.

There is a way to get this projected time using our statistical analysis system see User Guide :: Nsight Systems Documentation (it is an exact link, the forum software just truncates the text).

Would that nvtx_gpu_proj_trace get you what you need?

user33875 · October 3, 2023, 9:05pm

Thank you for your quick reply. That API indeed gets me very close to what I need but there are still two issues:

I am not super familiar with the profiling tool chain. Is it possible to configure the Nsys GUI to display the NVTX ranges using the projected time instead of CPU time? Currently I see the ranges (and their elapsed time) are in CPU times. Although after clicking on them their projected CUDA kernels are highlighted, it would be very convenient if switching is possible (basically use the first kernel’s begin time and last kernel’s finish time for this range).
As mentioned in that doc it doesn’t record kernels pushed in other threads, while in PyTorch it seems to use another thread to do backward.

I think maybe it gets what I need to just have an API that can construct a range containing all kernels within the given 2 events on the same stream. This sounds simpler but I am not quite sure how to achieve this. Any advice would be appreciated!

hwilper · October 4, 2023, 5:18pm

You can see NVTX next to the CPU and next to the GPU as in:

However, mousing over the NVTX ranges in either one will get you the CPU time. There is no way to give you the GPU time, because, well, that is kinda misleading. We are making an estimate there.

You might be able to write your own stats script that would, for each nvtx projection, give an approximate GPU time based on the start time of the first and the end time of the last kernel in that window. I’m going to loop in @jkreibich who can do a better way of helping you with that if that is the route you need to go.

jkreibich · October 4, 2023, 5:48pm

If I’m understanding this correctly, it sounds like you’re really trying to understand the relationship between kernel launches and executions, and are using NVTX ranges to put time markers around those events.

I don’t think we have any better solution to using NVTX than things Holly has already mentioned, but we do have a different report that might help. The cuda_kern_exec_trace stats report (in the GUI, I believe it is called CUDA Kernel Launch & Exec Time Trace) might tell you want you want to know. You can read the full help-test (nsys stats --help-report cuda_kern_exec_trace) but a quick description is that this report links each CPU-based CUDA API call to launch a kernel, along with the GPU runtime information about said kernel. The report shows the launch time, execution time, and the queue time between the two.

There is also a summary version of this report (CUDA Kernel Launch & Exec Time Summary, cuda_kern_exec_sum), which will group results by kernel. That might help if you’re looking for general trends.

My main concern with these reports is that, on the CPU side of things, the recorded time for the CUDA API call will only include the execution time of the CUDA C API call, and not the cost of any Python wrappers. It sounds like you might want the Python time, rather than the time of the C execution.

user33875 · October 4, 2023, 9:03pm

Thank you both for your detailed reply. The NVTX next to GPU looks very nice and appears to be what I am looking for. There’s one last bit that I would like to confirm: on my trace, the time showing next to the range appears to be the GPU time, i.e., the elapsed time from the start time of the enclosed first kernel to the end time of last kernel:

As you can see my hand-selected duration is 5.395ms, which after discounting my hand-selection error should be the same as the 5.352ms shown on that forward-compute range. Am I misunderstanding what you mean by “CPU time” and “GPU time”?

hwilper · October 4, 2023, 9:42pm

If you hover over a particular NVTX range, it will show you the start time and stop time of that range. If you hover on the CPU NVTX bubble for that range, the tooltip shows the exact same time as what you see if you hover over the GPU NVTX bubble for that range, even though technically (although it isn’t too obvious here) there could be a delay before the CUDA kernels started.

user33875 · October 5, 2023, 12:04am

In my case they are actually very different:

hwilper · October 5, 2023, 7:35pm

hmm…okay.

@jkreibich would the sqlite give that time?

jkreibich · October 6, 2023, 10:46pm

I mean, the data is in there somewhere. I’m not sure how the GUI calculates those numbers, however, or what the difference might be. We don’t currently have any stats reports that correlate both CUDA-API to Kernels and incorporates NVTX data, however… you can get “NVTX projected to kernels” or “API to kernels,” but I’m pretty sure we don’t have a report that ties it all together.

Topic		Replies	Views
How can I get the exact CPU and GPU time in NSYS NVTX profiling? Profiling Linux Targets	4	926	June 6, 2024
How to analyze nsight system results? Profiling Embedded Targets	2	510	November 13, 2023
NVTX ranges in Threads and in CUDA do not align with each other Profiling Linux Targets	3	879	October 12, 2021
The meaning of duration in an nvtx range Profiling Linux Targets	5	1067	December 29, 2022
What is the difference between the CUDA API and CUDA HW lines in the Nsight Systems GUI? cuDNN ai	1	105	February 28, 2025
Nvtx Nsight Compute	5	967	August 29, 2023
NVTX kernel code CUDA Programming and Performance	2	1354	December 3, 2015
NVIDIA Tools Extension API (NVTX): Annotation Tool for Profiling Code in Python and C/C++ Technical Blog	1	671	October 17, 2022
CUDA Pro Tip: Generate Custom Application Profile Timelines with NVTX Technical Blog	6	754	September 19, 2022
Duration of an NVTX Range Other Tools	0	498	February 19, 2020

NVTX with GPU timing?

Related topics