Nsight compute very slow when increasing kernel size, but works fine for smaller size

Basically, the kernel profiles correctly when 72x256 or even 36x256 which takes 15 seconds, but when increasing to 144x256 or 288x256, it might take 10 minutes or more, the kernel is correct, so no errors when running stand alone.
And the runtime is mostly the same, all of them take around 0.8ms so it is not a slow kernel.
Same issue in debug and release mode.
It gets stuck at this message:
==PROF== Profiling “ExtractIndicesByMaterial”: 0%.
And when I come back in 2-30 mins it is done.
This is the repository Bitbucket
And the kernel in question is ExtractIndicesByMaterial

Using latest Cuda 11.2 and nsight compute 2020.3.0
Drivers 461.09

Hmm seems that the way kernel replay works is not friendly with lots of small arrays, by using few big arrays instead of lots of small it went back to instant profiling. Still just because it doesn’t crash, doesn’t mean it is not a bug. It would be nice to display a warning if the kernel replay is having a stroke.

Are you using the Nsight Compute command line? What are the command line options? The overhead will depend on the selected metrics, total amount of memory that needs to be restored during kernel replay and number of kernel launches profiled.

Please refer the “Metric Collection->Overhead” section in the Nsight Compute Kernel Profiling Guide.

I am using the GUI version, and it seems that with equal size, few big arrays are ok, but hundreds of small arrays crash the performance by 1000x, even if the kernel runs 2x as slow, because of the bad memory access. Even if there is only 1 metric, for example speed of light, the performance is the same.

On that note is there a profiler for nsight compute? You know to profile the profiler.

We will try to reproduce the performance problem you are facing internally. My assumption is that it’s related to memory save-and-restore, which is done as part of kernel replay Kernel Profiling Guide :: Nsight Compute Documentation

To test this, you can try to run in application replay mode, where the whole app is re-run multiple times to collect all metric data, but the GPU memory accessed by a kernel does not need to be transferred by the tool. The --replay-mode option is available in the CLI, and in the UI’s (non-interactive) Profile activity while configuring it.

Thanks! It works fine in application replay mode. But the issue is only in kernel replay mode. I think the simplest way to reproduce is to allocate 100k arrays of few bytes and try to modify them in kernel. You will see massive slowdown regardless of what the kernel performance is.