Nsight Systems VPI operation times not accurate

Hello,

I have a VPI pipeline with a few operations. One of the last ones is a median filter. I am profiling it using nsight systems.

Timing my pipeline with std::chrono, (getting the time when I send the first operation and ending after I finish a streamSync), I get that the pipeline with the median filter runs in 5.6 milliseconds, and without the filter runs in 5.5 milliseconds. This leads me to conclude the median filter itself takes ~0.1 milliseconds.

Here are the graphs from nsight systems:
With median filter:

Without median filter:

Looking at these plots, it suggests a big chunk (2.8ms) of my pipeline is taken up by the median filter, and if I remove it it’s taken up by the unlock task.

However it doesn’t make sense that the median filter is taking up 2.8 ms, since if I remove it my pipeline gets shorter only by 0.1ms.

Why is there this discrepancy? Is the median filter operation actually taking up 2.8ms, or is nsight systems wrong? If it’s actually taking up 2.8ms, then why does removing it make the pipeline shorter only by 0.1ms. Is there some optimization I’m missing?

It is a little hard to tell from screen shots, but it looks to me like the problem is that any benefit you are getting timewise from removing the median filter is completely hidden by the time spent in the cudaStreamSynchronize call.

That would explain the numbers - a cudaStreamSynchronize call takes a fixed ~3ms to run, and whether or not a median filter is ran the time would be the same.

However it doesn’t make sense in other ways: 1) why is a 3x3 median filter taking 7x longer than stereo block matching 2) what is the stream synchronize doing for 3ms, and why is that time not affected by a 2.8ms median filter on the same stream.

I expanded the Cuda HW view and I can see sgm calls there both with and without the median filter. These suggests that the SGM is actually running during that time and takes much longer than the StereoDisparity block in VPI says. I guess I should be looking at the CUDA HW row? But what is the meaning of the VPI row in that case?

FYI, if you click on an event on the CPU side, we will show a projection of all of the events on the GPU side that are launched by that range. (In other words, we can do automatically what you drew on by hand)

I am not familiar with VPI in particular, however, the stream synchronize cannot complete until all the streams launched by that range are complete. The CPU is waiting for the GPU to finish. So the sync function is called, and starts on the nsys timeline, but then it waits for the ~3ms of work to finish, and then ends on the timeline.

(for more info about how we show latency’s and overhead, see Understanding the Visualization of Overhead and Latency in NVIDIA Nsight Systems | NVIDIA Technical Blog)

Thank you, the article was very helpful. I’m still not quite sure what the VPI row stands for, but the rest of nsight systems is enough to understand what’s actually going on.

VPI is a library, correct? I assume that VPI was instrumented with NVTX and gave it an NVTX domain labeled “VPI”. When Nsys sees that, we put the corresponding API calls in a row with the NVTX domain name.

Yes, VPI is a closed-source NVIDIA library. I assumed there was some special integration for VPI in Nsight Systems, I didn’t know NVTX let you make that customizeable of a UI.

Yup, it would work for NVTX annotations you add as well, but it is definitely our preferred way for adding profiling hooks to internal libraries. NVTX is available on github as a simple header file or C++/Python/Fortran wrappers, and both Nsight Systems and Nsight Compute will recognize the domain names and reflect it in the GUI.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.