I’m working with Nsight Compute/Cuda-11.7 version to profile a Fortran application. The runtime of the job is 3 minutes. But with Nsight compute profiler it doesn’t finish even in 3 hours. So looking for options to reduce profiling time
(1) profile kernels from a single process i.e. process with MPI Rank 0.
(2) profile only application kernels. i.e. skip kernels such as “__pgi_dev_cumemset_4n”
Are there such options available with Nsight Compute?