Option to profile only master process

Hello,

I’m working with Nsight Compute/Cuda-11.7 version to profile a Fortran application. The runtime of the job is 3 minutes. But with Nsight compute profiler it doesn’t finish even in 4 hours. So looking for options to reduce profiling time
(1) profile kernels from a single process i.e. process with MPI Rank 0.
(2) profile only application specific kernels. i.e. skip kernels such as “__pgi_dev_cumemset_4n”

Are there any such options available with Nsight Compute?

You can find further information on these topics here and here in the documentation.

The Nsight Systems (nsys) profiler output shows that following kernel consumes most of runtime:

[7/9] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)                                                   Name    
 --------  ---------------  ---------  -------------  -------------  -----------  -----------  ------------  ----------------------------------------------------------------------------------------------------
     17.2   35,060,287,713        392   89,439,509.5   73,738,924.0   36,020,468  274,459,150  31,305,026.5  void cutlass::Kernel<cutlass_80_tensorop_d884gemm_128x64_16x3_nn_align1>(T1::Params)

But the same kernel name can not be profiled under Nsight Compute i.e. ncu

export s_ncuprof_cmd=" ncu -o report.ncu.out --target-processes all -f -k cutlass::Kernel<cutlass_80_tensorop_d884gemm_64x64_16x4_tn_align1>(T1::Params) "
It says:

FORTRAN STOP
==PROF== Disconnected from process 79447
==WARNING== No kernels were profiled.

Please suggest, how to profile a kernel as listed out by NSYS output in NCU profiler.

Thanks

You have to specify the --kernel-name-base option to tell ncu which version of the kernel/function name it should consider for matching the filter. In your case, it would be --kernel-name-base demangled. Note that you can furthermore use regular expressions with the regex: prefix (as clarified in the docs) with -k.

Now with these NCU options

ncu --target-processes all -o report.ncu.tpall.out.%h.%p -f --kernel-name regex:cutlass_80_tensorop --kernel-name-base demangled

It fails with following error:

==ERROR== UnknownError
==ERROR== Failed to profile kernel "Kernel" in process 61987
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
srun: error: ak0038: task 20: Exited with exit code 9
srun: error: ak0040: task 39: Exited with exit code 9
==PROF== Profiling "void cutlass::Kernel<cutlass_80_tensorop_d884gemm_64x64_16x4_tn_align1>(T1::Params)": 0%

It’s an MPI code and srun[slurm] is used to run the job on nodes. However it works, if profiled only Rank 0 process:

if [[ ${SLURM_PROCID} -eq 0 ]]; then
        ncu --target-processes=all -o   report.ncu.tpall.out.%h.%p  -f --kernel-name regex:cutlass_80_tensorop  --kernel-name-base demangled "$@"
fi

Any option need to be used for multi-process profiling?

In general, we recommend profiling a single rank per node for MPI applications. Nsight Compute, by default, serializes kernels within a process, replays them, and needs a global lock on the profiling hardware. Because of this, multi-process applications with communication etc… like MPI applications, can often hang or fail when trying to profile multiple ranks in parallel.

Ok Thanks.

Without NCU profiler, the job’s run time is 3 minutes. But with NCU, and single kernel profiling [ncu --target-processes=all -o report.ncu.tpall.out.%h.%p -f --kernel-name regex:cutlass_80_tensorop --kernel-name-base demangled] it doesn’t complete in 4 hours also. Please suggest what option to use for a lesser profiling time and simultaneously get general characteristics[not necessarily looking to improve performance] of the code on gpu utilization. Thanks

It’s likely that you’re profiling many instances of the same kernel. One thing to try is profiling a subset using the --launch-count flag. You can optionally add the --launch-skip and --launch-skip-before-match flags. See here for more details on that Nsight Compute CLI :: Nsight Compute Documentation

Thanks.

Our objective of profiling this Fortran application is: to get general characteristics of code such as

  • Time split between CPU & GPU.
  • Summary of time spent in MPI, Data transfer between Host & Device and vice versa.
  • % of each GPU Utilization.
  • Memory Utilization etc

To get these details, which tool you suggest (1) nsys (2) ncu?

Nsight Systems is going to give you more of this type of information than Nsight Compute.

The imported output from a Nsight Compute profiler report shows some of the values as “nan”:

 void cutlass::Kernel<cutlass_80_tensorop_d884gemm_64x32_16x4_tn_align1>(T1::Params), 2023-Sep-06 01:39:44, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                            cycle/second                        (!) nan
    SM Frequency                                                              cycle/second                        (!) nan
    Elapsed Cycles                                                                   cycle                        (!) nan
    Memory [%]                                                                           %                        (!) nan
    DRAM Throughput                                                                      %                        (!) nan
    Duration                                                                       usecond                         345.06
    L1/TEX Cache Throughput                                                              %                        (!) nan
    L2 Cache Throughput                                                                  %                        (!) nan
    SM Active Cycles                                                                 cycle                        (!) nan
    Compute (SM) [%]                                                                     %                        (!) nan
    ---------------------------------------------------------------------- --------------- ------------------------------

A single kernel: “cutlass::Kernel<cutlass_80_tensorop_d884gemm_64x32_16x4_tn_align1>(T1::Params)” is profiled using only 1 process [with mpi rank 0]. 12 ranks/node, 4 gpus/node, total 4 nodes.

Why some values are “nan” here?

Can you attach the Nsight Compute report to this issue? 11.7 is a very old version of Nsight Compute. Are you able to try the newest version?