Takes days to profile my code

gavin.keith.ridley · March 22, 2021, 4:21pm

Hi all,

So, for the application I’m profiling, we have a few poorly designed kernels which matter very little to application performance, and are launched many, many times at the beginning of the application with a small number of threads. While this isn’t ideal for kernels, let’s pretend for the moment that this is a constant. Someone may say this part should be re-engineered to not be this way, but this lets us share much more calculation setup code with the CPU version of our program. This setup part of the calculation takes about one minute to run, if not profiling.

Anyway, nsight compute takes a huge amount of time to get through these initial parts, even though I’m only profiling a kernel that runs much later in the application. Why is nsight compute executing my application at a much slower pace, even in the parts that I’m not profiling? Is there a way around this? I wish it would only start to slow the application down once it starts profiling the kernel of interest.

As another detail, I’m doing a regex match on the demangled kernel name. Maybe it’s faster to match against a specific kernel ID or something?

Looking forward to some advice.

EDIT: my code does use a multitude of separately allocated arrays, and the slowdown is likely related to what’s observed in this post.

Sanjiv.Satoor · April 16, 2021, 1:03pm

Can you please confirm the ncu version you are using (by posting the output of ncu --version)?

Why is nsight compute executing my application at a much slower pace, even in the parts that I’m not profiling?

Such a slowdown is not expected for parts of the application which are not being profiled.

As another detail, I’m doing a regex match on the demangled kernel name. Maybe it’s faster to match against a specific kernel ID or something?

Using regex match should be fine. It should not result in slowdown.
Even with the regex match the number of kernels profiled could be large. You can try and limit the number of kernels profiled by additionally using the ncu --launch-count option.
Also you can limit the number of metrics collected by using the option --metrics or --section.
Refer the Metric Collection->Overhead section in the Nsight Compute Kernel Profiling Guide.

gavin.keith.ridley · April 16, 2021, 2:28pm

OK, thanks for the info. Here’s my version.

NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2021 NVIDIA Corporation
Version 2020.3.1.0 (build 29471205) (public-release)

Indeed, my regex only matches the kernel I wish to profile. I’m seeing massive slowdown of my code well before the program launches the kernel of interest. On top of that, I indeed set a limit on the number of launches to profile. I have the program set to only launch five times, which takes about a day to run. So far, I have been unable to produce a minimal working example to reproduce this.

I should reiterate that when not run under the profiler, the code is maybe about 1000 times faster, and indeed gets correct results. I’ve ran it through cuda-memcheck, and it’s not doing anything funky there either.

Maybe it will help to add this; I see roughly the same slowdown when launching the program under cuda-gdb for the kernels not being profiled.

Sanjiv.Satoor · April 19, 2021, 5:24am

My earlier statement “Such a slowdown is not expected for parts of the application which are not being profiled.” is not accurate.
Nsight Compute serializes all kernels and API calls (refer the serialization section in the Nsight Compute Kernel profiling guide). So the application will run slower, especially if it’s multi-threaded/multi-streamed, or if there are many small kernels.

Sanjiv.Satoor · April 19, 2021, 5:37am

How many kernels are launched? When you say it “takes about a day to run” - is this without profiling?
As suggested earlier did you try?

To limit the number of kernels profiled by additionally using the ncu --launch-count option.
To limit the number of metrics collected by using the option --metrics or --section.

gavin.keith.ridley · April 26, 2021, 6:35pm

Thanks Sanjiv! Indeed, my application runs many small kernels up front which are not performance intensive. This was slowing down the profile dramatically. I recently updated CUDA, and the application seems to profile much faster now. There must have been some small change made somewhere that really helped my app out during profiles.

Sanjiv.Satoor · April 27, 2021, 6:16am

Good to know that you are now able to profile and the issue is resolved.

Topic		Replies	Views
Nsight compute very slow when increasing kernel size, but works fine for smaller size Nsight Compute	5	1230	January 21, 2021
Ncu profile file not created Nsight Compute	5	1107	September 1, 2021
Too long runtime with ncu Nsight Compute	1	1238	June 24, 2022
NSIGHT Compute hangs at profiling CUDA application Nsight Compute	1	627	July 20, 2023
==ERROR== Failed to prepare kernel for profiling (0xc00000fd) but CUDA sample works Nsight Compute kernel , nvbugs	13	2045	November 6, 2021
Profiling in a code line resolution CUDA Programming and Performance	7	7054	December 6, 2011
Profiling one application having two concurent kernels Nsight Compute	3	602	June 8, 2023
Why the same kernel runs a different speed when invoke more than once? Nsight Compute	3	972	June 3, 2022
Nsight Compute not reporting/profiling all kernels profiled by Nsight Systems Nsight Compute	9	561	March 27, 2024
Nsight compute hanging issue Nsight Compute kernel	7	824	March 11, 2024

Takes days to profile my code

Related topics