Takes days to profile my code

Hi all,

So, for the application I’m profiling, we have a few poorly designed kernels which matter very little to application performance, and are launched many, many times at the beginning of the application with a small number of threads. While this isn’t ideal for kernels, let’s pretend for the moment that this is a constant. Someone may say this part should be re-engineered to not be this way, but this lets us share much more calculation setup code with the CPU version of our program. This setup part of the calculation takes about one minute to run, if not profiling.

Anyway, nsight compute takes a huge amount of time to get through these initial parts, even though I’m only profiling a kernel that runs much later in the application. Why is nsight compute executing my application at a much slower pace, even in the parts that I’m not profiling? Is there a way around this? I wish it would only start to slow the application down once it starts profiling the kernel of interest.

As another detail, I’m doing a regex match on the demangled kernel name. Maybe it’s faster to match against a specific kernel ID or something?

Looking forward to some advice.

EDIT: my code does use a multitude of separately allocated arrays, and the slowdown is likely related to what’s observed in this post.

Can you please confirm the ncu version you are using (by posting the output of ncu --version)?

Why is nsight compute executing my application at a much slower pace, even in the parts that I’m not profiling?

Such a slowdown is not expected for parts of the application which are not being profiled.

As another detail, I’m doing a regex match on the demangled kernel name. Maybe it’s faster to match against a specific kernel ID or something?

Using regex match should be fine. It should not result in slowdown.
Even with the regex match the number of kernels profiled could be large. You can try and limit the number of kernels profiled by additionally using the ncu --launch-count option.
Also you can limit the number of metrics collected by using the option --metrics or --section.
Refer the Metric Collection->Overhead section in the Nsight Compute Kernel Profiling Guide.

OK, thanks for the info. Here’s my version.

NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2021 NVIDIA Corporation
Version 2020.3.1.0 (build 29471205) (public-release)

Indeed, my regex only matches the kernel I wish to profile. I’m seeing massive slowdown of my code well before the program launches the kernel of interest. On top of that, I indeed set a limit on the number of launches to profile. I have the program set to only launch five times, which takes about a day to run. So far, I have been unable to produce a minimal working example to reproduce this.

I should reiterate that when not run under the profiler, the code is maybe about 1000 times faster, and indeed gets correct results. I’ve ran it through cuda-memcheck, and it’s not doing anything funky there either.

Maybe it will help to add this; I see roughly the same slowdown when launching the program under cuda-gdb for the kernels not being profiled.

My earlier statement “Such a slowdown is not expected for parts of the application which are not being profiled.” is not accurate.
Nsight Compute serializes all kernels and API calls (refer the serialization section in the Nsight Compute Kernel profiling guide). So the application will run slower, especially if it’s multi-threaded/multi-streamed, or if there are many small kernels.

How many kernels are launched? When you say it “takes about a day to run” - is this without profiling?
As suggested earlier did you try?

  1. To limit the number of kernels profiled by additionally using the ncu --launch-count option.
  2. To limit the number of metrics collected by using the option --metrics or --section.

Thanks Sanjiv! Indeed, my application runs many small kernels up front which are not performance intensive. This was slowing down the profile dramatically. I recently updated CUDA, and the application seems to profile much faster now. There must have been some small change made somewhere that really helped my app out during profiles.

Good to know that you are now able to profile and the issue is resolved.

1 Like