Is the profiling session duration equivalent to total runtime when using Nsight Systems?

I am a student working on a project which involves CPU and GPU profiling using NVIDIA Nsight Systems. I’m trying to wrap my head around all of the different visuals, options and statistics. My main question is: is the profiling session duration (the full timeline view) equivalent to total runtime for the program? I can see both CPUs and GPUs were profiled, including initializations for both. I’m comparing two runs to hopefully pick up differences between the two and it’s rather difficult with the level of experience I have thus far. Adjustments were made to the GPU side only. Any help would be very appreciated. Thank you!

Yeah, the disadvantage of making a tool with a lot of features is that there is a lot to try to visually parse through.

I am going to recommend that you use the gpu-kernel-summary stats script to get information about all the kernels that ran in each run:

May I ask what the “adjustments” were between runs?

I don’t know if you are aware, but you can also open multiple reports in the same timeline, or in panes in the tool, see User Guide — nsight-systems 2024.2 documentation for details. (That is a direct link to the section, the forum software just munges the text)

Oh, and as to the title question. Presuming you are using the options that launch the application with Nsys, do not have a delay before collection, and end at the end of the run (rather than after a few second duration), yes, the length of the profile should equal wall clock time of the application running under profiling

I first used Nsight Compute to narrow down the GPU kernels that may be causing bottlenecks. One of the kernels stood out as a likely culprit. I visually saw that there were a few sections in the code that could be improved a bit (reducing redundant if statements and breaking up multiple blocks). I broke it up into 4 separate parallel blocks, 2 of which ran concurrently. I was also able to adjust 2 other kernels to run concurrently that weren’t before. I ran everything on my local machine which has an NVIDIA RTX 3080, and saw nominal improvements with the adjustments. Now I am running on a NSCA DELTA allocation on NVIDIA A100’s and from the looks of it, the GPU runtime wasn’t really improved but the overall profiling session duration (overall runtime?) reduced by 2 seconds. I ran a larger test case and confirmed the same reduction of profiling session duration, around 3 seconds (8 seconds to 5 seconds). I’m a bit disappointed that there seems to be no improvement on the GPU side while also trying to figure out why the CPU has seemingly been improved. Also trying to understand everything I’m looking at within the Systems reports in general haha. It’s a lot!

(as an aside, does that mean you are in Champaign-Urbana? My husband works at NCSA)

Okay, we usually recommend going the other way, starting with Nsight Systems to evaluate what is going on on the entire system and determine if your issue is with things like memory transfer overhead or GPU starvation before you try to dive down into individual kernels. I’m going to point you at a blog I wrote which also makes a good primer on a bunch of the timeline features.

Without looking at the results, I am guessing here, but it can certainly happen that you have the same amount of actual GPU work to do, but that by optimizing the kernel on the CPU side, you improved the packaging and transfer of the work to the GPU.

I’m actually a student at University of California, Irvine! The allocation is through an internship.

Regarding Systems / Compute - that’s funny, because most of the vids and tutorials I read through all suggested Compute first to analyze individual kernels for bottlenecks, haha. In any case, I’m still learning all the lingo (GPU starvation, etc) so I don’t think it mattered which route I chose 😅. I’ll definitely read through your blog post! Thank you for sharing that.

Is it the case that adjustments to the CUDA portion of the code can result in CPU improvements? I can see a significant reduction in the CUDA profiling initialization between both runs (and in the larger test case). What can this be caused by? I actually expected more overhead because we split the kernels up to 4, but maybe the added concurrency played a role as well?

We are working on finishing up an OpenACC implementation as a final optimization to the code base and I’m crossing my fingers it will improve the GPU efficiency more than what I’m seeing from the adjustments that have been made thus far.

On another note - is it common for runtime to vary significantly depending on hardware? For example, the RTX showed improvement and the A100 did not.
I think my main takeaway here is to reserve celebrations until after running on different hardware, haha. The RTX runs gave me false confidence. 😅

So the functions you write in your C++ code as “CUDA code” are telling the CUDA compiler what can be run on the GPU, there is still some CPU work that has to be done there. So technically yes.

and absolutely, different GPUs can and will have different performance. Mostly because there will be different optimal sizes for memory transfers and different optimal sizes of works. So the fixes you make to clean up the CPU code save you everywhere, and the big changes to make to improved the GPU performance remain useful, but there will often need to be a bit of tuning between generations.

Your overall runtime requires you to keep both the CPU and the GPU busy

What exactly does this mean?

Interesting. I’ll share this information with my team when we discuss the runs. Have you used OpenACC before?

I have not personally, no. I have heard good things about it.