Is the profiling session duration equivalent to total runtime when using Nsight Systems?

morganoliviaevans · May 2, 2024, 12:22am

I am a student working on a project which involves CPU and GPU profiling using NVIDIA Nsight Systems. I’m trying to wrap my head around all of the different visuals, options and statistics. My main question is: is the profiling session duration (the full timeline view) equivalent to total runtime for the program? I can see both CPUs and GPUs were profiled, including initializations for both. I’m comparing two runs to hopefully pick up differences between the two and it’s rather difficult with the level of experience I have thus far. Adjustments were made to the GPU side only. Any help would be very appreciated. Thank you!

hwilper · May 2, 2024, 2:28pm

Yeah, the disadvantage of making a tool with a lot of features is that there is a lot to try to visually parse through.

I am going to recommend that you use the gpu-kernel-summary stats script to get information about all the kernels that ran in each run:

May I ask what the “adjustments” were between runs?

hwilper · May 2, 2024, 2:32pm

I don’t know if you are aware, but you can also open multiple reports in the same timeline, or in panes in the tool, see User Guide — nsight-systems 2024.2 documentation for details. (That is a direct link to the section, the forum software just munges the text)

hwilper · May 2, 2024, 2:34pm

Oh, and as to the title question. Presuming you are using the options that launch the application with Nsys, do not have a delay before collection, and end at the end of the run (rather than after a few second duration), yes, the length of the profile should equal wall clock time of the application running under profiling

morganoliviaevans · May 2, 2024, 3:04pm

I first used Nsight Compute to narrow down the GPU kernels that may be causing bottlenecks. One of the kernels stood out as a likely culprit. I visually saw that there were a few sections in the code that could be improved a bit (reducing redundant if statements and breaking up multiple blocks). I broke it up into 4 separate parallel blocks, 2 of which ran concurrently. I was also able to adjust 2 other kernels to run concurrently that weren’t before. I ran everything on my local machine which has an NVIDIA RTX 3080, and saw nominal improvements with the adjustments. Now I am running on a NSCA DELTA allocation on NVIDIA A100’s and from the looks of it, the GPU runtime wasn’t really improved but the overall profiling session duration (overall runtime?) reduced by 2 seconds. I ran a larger test case and confirmed the same reduction of profiling session duration, around 3 seconds (8 seconds to 5 seconds). I’m a bit disappointed that there seems to be no improvement on the GPU side while also trying to figure out why the CPU has seemingly been improved. Also trying to understand everything I’m looking at within the Systems reports in general haha. It’s a lot!

hwilper · May 2, 2024, 4:11pm

(as an aside, does that mean you are in Champaign-Urbana? My husband works at NCSA)

Okay, we usually recommend going the other way, starting with Nsight Systems to evaluate what is going on on the entire system and determine if your issue is with things like memory transfer overhead or GPU starvation before you try to dive down into individual kernels. I’m going to point you at a blog I wrote https://developer.nvidia.com/blog/optimizing-cuda-memory-transfers-with-nsight-systems/ which also makes a good primer on a bunch of the timeline features.

Without looking at the results, I am guessing here, but it can certainly happen that you have the same amount of actual GPU work to do, but that by optimizing the kernel on the CPU side, you improved the packaging and transfer of the work to the GPU.

morganoliviaevans · May 2, 2024, 7:07pm

I’m actually a student at University of California, Irvine! The allocation is through an internship.

Regarding Systems / Compute - that’s funny, because most of the vids and tutorials I read through all suggested Compute first to analyze individual kernels for bottlenecks, haha. In any case, I’m still learning all the lingo (GPU starvation, etc) so I don’t think it mattered which route I chose 😅. I’ll definitely read through your blog post! Thank you for sharing that.

Is it the case that adjustments to the CUDA portion of the code can result in CPU improvements? I can see a significant reduction in the CUDA profiling initialization between both runs (and in the larger test case). What can this be caused by? I actually expected more overhead because we split the kernels up to 4, but maybe the added concurrency played a role as well?

We are working on finishing up an OpenACC implementation as a final optimization to the code base and I’m crossing my fingers it will improve the GPU efficiency more than what I’m seeing from the adjustments that have been made thus far.

On another note - is it common for runtime to vary significantly depending on hardware? For example, the RTX showed improvement and the A100 did not.
I think my main takeaway here is to reserve celebrations until after running on different hardware, haha. The RTX runs gave me false confidence. 😅

hwilper · May 2, 2024, 9:56pm

So the functions you write in your C++ code as “CUDA code” are telling the CUDA compiler what can be run on the GPU, there is still some CPU work that has to be done there. So technically yes.

and absolutely, different GPUs can and will have different performance. Mostly because there will be different optimal sizes for memory transfers and different optimal sizes of works. So the fixes you make to clean up the CPU code save you everywhere, and the big changes to make to improved the GPU performance remain useful, but there will often need to be a bit of tuning between generations.

hwilper · May 2, 2024, 9:57pm

Your overall runtime requires you to keep both the CPU and the GPU busy

morganoliviaevans · May 3, 2024, 12:08am

What exactly does this mean?

morganoliviaevans · May 3, 2024, 12:10am

Interesting. I’ll share this information with my team when we discuss the runs. Have you used OpenACC before?

hwilper · May 6, 2024, 2:09pm

I have not personally, no. I have heard good things about it.

Topic		Replies	Views
Profiling in a code line resolution CUDA Programming and Performance	7	7047	December 6, 2011
Is the Nsight System accurate in measuring the execution time of the kernel? Profiling Linux Targets	14	1301	April 6, 2024
Kernels in CUDA streams seems not running in parallel Profiling Linux Targets	8	779	April 7, 2024
Kernel time of Nsight system is larger than nsight compute Profiling Linux Targets	11	792	April 3, 2024
Inconsistent results with nsight systems Profiling Embedded Targets	5	796	June 20, 2023
How can I get the exact CPU and GPU time in NSYS NVTX profiling? Profiling Linux Targets	4	343	June 6, 2024
How can I estimate the overall execution time of CPU and GPU separately? Profiling Linux Targets cuda , nsight	0	582	December 21, 2021
Sum of kernel time is different in ncu and nsys Profiling Linux Targets nsight	11	3066	March 15, 2022
Profiler timings vs. real world timings. VERY different... CUDA Programming and Performance	8	2378	May 15, 2009
Nsys doesn't show cuda kernel and memory data Profiling Linux Targets cuda , kernel	10	28	December 7, 2024

Is the profiling session duration equivalent to total runtime when using Nsight Systems?

Related topics