Basically, the kernel profiles correctly when 72x256 or even 36x256 which takes 15 seconds, but when increasing to 144x256 or 288x256, it might take 10 minutes or more, the kernel is correct, so no errors when running stand alone.
And the runtime is mostly the same, all of them take around 0.8ms so it is not a slow kernel.
Same issue in debug and release mode.
It gets stuck at this message:
==PROF== Profiling “ExtractIndicesByMaterial”: 0%.
And when I come back in 2-30 mins it is done.
This is the repository https://bitbucket.org/emelrad12/shaders
And the kernel in question is ExtractIndicesByMaterial
Using latest Cuda 11.2 and nsight compute 2020.3.0
Drivers 461.09