So, for the application I’m profiling, we have a few poorly designed kernels which matter very little to application performance, and are launched many, many times at the beginning of the application with a small number of threads. While this isn’t ideal for kernels, let’s pretend for the moment that this is a constant. Someone may say this part should be re-engineered to not be this way, but this lets us share much more calculation setup code with the CPU version of our program. This setup part of the calculation takes about one minute to run, if not profiling.
Anyway, nsight compute takes a huge amount of time to get through these initial parts, even though I’m only profiling a kernel that runs much later in the application. Why is nsight compute executing my application at a much slower pace, even in the parts that I’m not profiling? Is there a way around this? I wish it would only start to slow the application down once it starts profiling the kernel of interest.
As another detail, I’m doing a regex match on the demangled kernel name. Maybe it’s faster to match against a specific kernel ID or something?
Looking forward to some advice.
EDIT: my code does use a multitude of separately allocated arrays, and the slowdown is likely related to what’s observed in this post.