Nsight Graphics 2019.6.0: heavy CPU overhead on GL draw thread after Nsight app launch

Hi,

I’m running Nsight Graphics 2019.6.0 on 64-bit Windows 7 Embedded with 441.20 NVIDIA drivers. My GPU is RTX 2080 Ti.

My app is based on the OpenGL 4.6.0 compatibility profile.

I launch my app as follows: c:\program files\NVIDIA Corporation\Nsight Graphics 2019.6.0\host\windows-desktop-nomad-x64\nv-nsight-launcher.exe --activity “Frame Profiler” --exe --args

My app launches successfully; Nsight says “Launch succeeded.” on the command line.

+++

My problem is that my CPU thread that feeds OpenGL draw commands runs considerably slower if I start my app this way. The app’s frame rate drops way, way down. As part my frame loop, I measure the time passing between the start and end of a frame’s draw commands as experienced by the CPU. The results:

Nsight OFF 6.7ms / ON 32.9ms.

So launching via the Nsight launcher has a significant impact on the performance of the app.

(Now, if I open up the Nsight Graphics tool, I can see my app issues ~77000 events per frame which is a lot. Just so you know. Also, there are some compatibility warnings, but the tool seems to work. At least the biggest problem seems to be performance at the moment.)

Anyway, I decided to run the AMD CodeAnalyst profiler on my app after an Nsight launch (with Nsight Graphics tool not running). The result is that most of the time is spent in Nvda.Graphics.Interception.dll. Inside that module the call with the biggest sample count by a large margin is PathUtils::GetApplicationName. Not quite the call you’d expect!

If I launch some other app, for example, FurMark, using the Nsight launcher, it only has a small impact on FurMark’s framerate (Nsight OFF 134 fps, ON 132 fps). Likewise, if I profile it running with CodeAnalyst, GetApplicationName does not even appear in the results.

So, what could be going on? In my app’s case, is the Interception.dll not getting my application’s name and retrying indefinitely?

What does it need the application name for?

Any idea what could be going on? GetApplicationName could be a good hint.

(My app is not easily reduced to a minimum testset, because it’s a real-time rendering product that’s been in development for years.)

Best regards,

Jani

GetApplicationName probably appears on the profiler because it’s the nearest known export symbol. The code offsets relative to GetApplicationName are on the order of 5 megabytes, so that must be the case.

So, maybe it’s my large event count times the NV interception overhead, then. Still, if we compare event counts, my extra overhead does not seem to be in line with the overhead observed in the case of FurMark. My app probably finds an extra slow path inside the interceptor.

Jani

After a “Capture for Live Analysis”, Nsight Graphics’ API Statistics View reports the following API calls as consuming the most CPU time:

count   API call          totalcpu-ms   
1832  glDrawRangeElements 15.1          
1844  glDrawArrays        14.79         
34     ....               1.76          
  ... the rest have minor contributions ...

For the scene in question the normal total cpu time per frame is 6…7 ms, so compared to that 15.1+14.8ms is out of this world.

Hello,

I’m sorry you ran into this issue with Nsight Graphics. I will file an internal bug for our engineering team to investigate your issue regarding launching via the Nsight launcher has a significant impact on the performance of your app. Will get back to you with any questions we may have.

Regards,

Darrell

@The_Scytheman sorry for the delay in following up on this topic.

Running through a CPU profiler with Nsight will indicate the module’s cost but the symbol attribution won’t be correct, as you said.

The API statistics view will be showing the driver time of the operations that you are executing – that is separate from the overhead that Nsight introduces.

As a debugger, Nsight is expected to introduce some overhead, but what you are seeing is beyond what we try to achieve. It is challenging to determine where exactly the cost is going without a full profiler run, which would require access to your app, but one thing that might provide value is to share a C++ capture of your application. Assuming that it builds and runs correctly, which is not necessarily a given with the compatibility warnings, we can infer the frametime cost from running it. Is this something you can collect and share? Feel free to contact me privately to make arrangements.