NVTX with GPU timing?

I have read several posts and it seems that NVTX timings are traced entirely on CPU. Is it possible to have a GPU-timing version of NVTX?

For example, say I want to know how long it takes for a list of kernels without device synchronization, then I could record two GPU events at the beginning and the end and use CUDA API to get their time difference. I am also trying to visualize it in a timeline similar to what we can have with nvtxRangePush/nvtxRangePop. But using nvtx naively won’t work since the timing it records is CPU times not GPU.

Is it possible to configure NVTX to use GPU timing? Or are there any alternative tools that are more suitable for this scenario?

Thanks in advance.

NVTX is a CPU only API, it literally doesn’t exist on the GPU side. Now, what Nsys does in our GUI is that we “project” the NVTX ranges onto the GPU. Literally we map the CUDA kernels that are inside the NVTX range.

There is a way to get this projected time using our statistical analysis system see User Guide :: Nsight Systems Documentation (it is an exact link, the forum software just truncates the text).

Would that nvtx_gpu_proj_trace get you what you need?

Thank you for your quick reply. That API indeed gets me very close to what I need but there are still two issues:

  1. I am not super familiar with the profiling tool chain. Is it possible to configure the Nsys GUI to display the NVTX ranges using the projected time instead of CPU time? Currently I see the ranges (and their elapsed time) are in CPU times. Although after clicking on them their projected CUDA kernels are highlighted, it would be very convenient if switching is possible (basically use the first kernel’s begin time and last kernel’s finish time for this range).
  2. As mentioned in that doc it doesn’t record kernels pushed in other threads, while in PyTorch it seems to use another thread to do backward.

I think maybe it gets what I need to just have an API that can construct a range containing all kernels within the given 2 events on the same stream. This sounds simpler but I am not quite sure how to achieve this. Any advice would be appreciated!

You can see NVTX next to the CPU and next to the GPU as in:

However, mousing over the NVTX ranges in either one will get you the CPU time. There is no way to give you the GPU time, because, well, that is kinda misleading. We are making an estimate there.

You might be able to write your own stats script that would, for each nvtx projection, give an approximate GPU time based on the start time of the first and the end time of the last kernel in that window. I’m going to loop in @jkreibich who can do a better way of helping you with that if that is the route you need to go.

1 Like

If I’m understanding this correctly, it sounds like you’re really trying to understand the relationship between kernel launches and executions, and are using NVTX ranges to put time markers around those events.

I don’t think we have any better solution to using NVTX than things Holly has already mentioned, but we do have a different report that might help. The cuda_kern_exec_trace stats report (in the GUI, I believe it is called CUDA Kernel Launch & Exec Time Trace) might tell you want you want to know. You can read the full help-test (nsys stats --help-report cuda_kern_exec_trace) but a quick description is that this report links each CPU-based CUDA API call to launch a kernel, along with the GPU runtime information about said kernel. The report shows the launch time, execution time, and the queue time between the two.

There is also a summary version of this report (CUDA Kernel Launch & Exec Time Summary, cuda_kern_exec_sum), which will group results by kernel. That might help if you’re looking for general trends.

My main concern with these reports is that, on the CPU side of things, the recorded time for the CUDA API call will only include the execution time of the CUDA C API call, and not the cost of any Python wrappers. It sounds like you might want the Python time, rather than the time of the C execution.

1 Like

Thank you both for your detailed reply. The NVTX next to GPU looks very nice and appears to be what I am looking for. There’s one last bit that I would like to confirm: on my trace, the time showing next to the range appears to be the GPU time, i.e., the elapsed time from the start time of the enclosed first kernel to the end time of last kernel:

As you can see my hand-selected duration is 5.395ms, which after discounting my hand-selection error should be the same as the 5.352ms shown on that forward-compute range. Am I misunderstanding what you mean by “CPU time” and “GPU time”?

If you hover over a particular NVTX range, it will show you the start time and stop time of that range. If you hover on the CPU NVTX bubble for that range, the tooltip shows the exact same time as what you see if you hover over the GPU NVTX bubble for that range, even though technically (although it isn’t too obvious here) there could be a delay before the CUDA kernels started.

In my case they are actually very different:


hmm…okay.

@jkreibich would the sqlite give that time?

I mean, the data is in there somewhere. I’m not sure how the GUI calculates those numbers, however, or what the difference might be. We don’t currently have any stats reports that correlate both CUDA-API to Kernels and incorporates NVTX data, however… you can get “NVTX projected to kernels” or “API to kernels,” but I’m pretty sure we don’t have a report that ties it all together.