Option to profile only master process

Hello,

I’m working with Nsight Compute/Cuda-11.7 version to profile a Fortran application. The runtime of the job is 3 minutes. But with Nsight compute profiler it doesn’t finish even in 4 hours. So looking for options to reduce profiling time
(1) profile kernels from a single process i.e. process with MPI Rank 0.
(2) profile only application specific kernels. i.e. skip kernels such as “__pgi_dev_cumemset_4n”

Are there any such options available with Nsight Compute?

You can find further information on these topics here and here in the documentation.

The Nsight Systems (nsys) profiler output shows that following kernel consumes most of runtime:

[7/9] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)                                                   Name    
 --------  ---------------  ---------  -------------  -------------  -----------  -----------  ------------  ----------------------------------------------------------------------------------------------------
     17.2   35,060,287,713        392   89,439,509.5   73,738,924.0   36,020,468  274,459,150  31,305,026.5  void cutlass::Kernel<cutlass_80_tensorop_d884gemm_128x64_16x3_nn_align1>(T1::Params)

But the same kernel name can not be profiled under Nsight Compute i.e. ncu

export s_ncuprof_cmd=" ncu -o report.ncu.out --target-processes all -f -k cutlass::Kernel<cutlass_80_tensorop_d884gemm_64x64_16x4_tn_align1>(T1::Params) "
It says:

FORTRAN STOP
==PROF== Disconnected from process 79447
==WARNING== No kernels were profiled.

Please suggest, how to profile a kernel as listed out by NSYS output in NCU profiler.

Thanks

You have to specify the --kernel-name-base option to tell ncu which version of the kernel/function name it should consider for matching the filter. In your case, it would be --kernel-name-base demangled. Note that you can furthermore use regular expressions with the regex: prefix (as clarified in the docs) with -k.

Now with these NCU options

ncu --target-processes all -o report.ncu.tpall.out.%h.%p -f --kernel-name regex:cutlass_80_tensorop --kernel-name-base demangled

It fails with following error:

==ERROR== UnknownError
==ERROR== Failed to profile kernel "Kernel" in process 61987
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
srun: error: ak0038: task 20: Exited with exit code 9
srun: error: ak0040: task 39: Exited with exit code 9
==PROF== Profiling "void cutlass::Kernel<cutlass_80_tensorop_d884gemm_64x64_16x4_tn_align1>(T1::Params)": 0%

It’s an MPI code and srun[slurm] is used to run the job on nodes. However it works, if profiled only Rank 0 process:

if [[ ${SLURM_PROCID} -eq 0 ]]; then
        ncu --target-processes=all -o   report.ncu.tpall.out.%h.%p  -f --kernel-name regex:cutlass_80_tensorop  --kernel-name-base demangled "$@"
fi

Any option need to be used for multi-process profiling?

In general, we recommend profiling a single rank per node for MPI applications. Nsight Compute, by default, serializes kernels within a process, replays them, and needs a global lock on the profiling hardware. Because of this, multi-process applications with communication etc… like MPI applications, can often hang or fail when trying to profile multiple ranks in parallel.

Ok Thanks.

Without NCU profiler, the job’s run time is 3 minutes. But with NCU, and single kernel profiling [ncu --target-processes=all -o report.ncu.tpall.out.%h.%p -f --kernel-name regex:cutlass_80_tensorop --kernel-name-base demangled] it doesn’t complete in 4 hours also. Please suggest what option to use for a lesser profiling time and simultaneously get general characteristics[not necessarily looking to improve performance] of the code on gpu utilization. Thanks

It’s likely that you’re profiling many instances of the same kernel. One thing to try is profiling a subset using the --launch-count flag. You can optionally add the --launch-skip and --launch-skip-before-match flags. See here for more details on that Nsight Compute CLI :: Nsight Compute Documentation

Thanks.

Our objective of profiling this Fortran application is: to get general characteristics of code such as

  • Time split between CPU & GPU.
  • Summary of time spent in MPI, Data transfer between Host & Device and vice versa.
  • % of each GPU Utilization.
  • Memory Utilization etc

To get these details, which tool you suggest (1) nsys (2) ncu?

Nsight Systems is going to give you more of this type of information than Nsight Compute.

The imported output from a Nsight Compute profiler report shows some of the values as “nan”:

 void cutlass::Kernel<cutlass_80_tensorop_d884gemm_64x32_16x4_tn_align1>(T1::Params), 2023-Sep-06 01:39:44, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                            cycle/second                        (!) nan
    SM Frequency                                                              cycle/second                        (!) nan
    Elapsed Cycles                                                                   cycle                        (!) nan
    Memory [%]                                                                           %                        (!) nan
    DRAM Throughput                                                                      %                        (!) nan
    Duration                                                                       usecond                         345.06
    L1/TEX Cache Throughput                                                              %                        (!) nan
    L2 Cache Throughput                                                                  %                        (!) nan
    SM Active Cycles                                                                 cycle                        (!) nan
    Compute (SM) [%]                                                                     %                        (!) nan
    ---------------------------------------------------------------------- --------------- ------------------------------

A single kernel: “cutlass::Kernel<cutlass_80_tensorop_d884gemm_64x32_16x4_tn_align1>(T1::Params)” is profiled using only 1 process [with mpi rank 0]. 12 ranks/node, 4 gpus/node, total 4 nodes.

Why some values are “nan” here?

Can you attach the Nsight Compute report to this issue? 11.7 is a very old version of Nsight Compute. Are you able to try the newest version?

As suggested, rebuilt the code with cuda-12.0.x version and profiled the code with 12.0.x Nsight compute. But still this version also produces “nan” values in the report output:

[14549] pw.x@127.0.0.1
  void cutlass::Kernel<cutlass_80_tensorop_d884gemm_64x64_16x4_tn_align1>(T1::Params) (19, 1, 11)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 8.0
    Section: GPU Speed Of Light Throughput
    ----------------------- ------------ ------------
    Metric Name              Metric Unit Metric Value
    ----------------------- ------------ ------------
    DRAM Frequency          cycle/second      (!) nan
    SM Frequency            cycle/second      (!) nan
    Elapsed Cycles                 cycle      (!) nan
    Memory Throughput                  %      (!) nan
    DRAM Throughput                    %      (!) nan
    Duration                     msecond        10.62
    L1/TEX Cache Throughput            %      (!) nan
    L2 Cache Throughput                %      (!) nan
    SM Active Cycles               cycle      (!) nan
    Compute (SM) Throughput            %      (!) nan
    ----------------------- ------------ ------------

Please suggest how to get a report with accurate values. i.e. They shouldn’t contain “nan” values.

Thanks

The ncu report is attached herewith. Note that it is profiled with mpi rank 0 only.
report.ncu.tpall.out.NOAD0033.14542.ncu-rep.zip (67.6 KB)

Are there any warnings or errors printed to the console when you run this profile. This does seem strange. Do you have any small kernels, like the CUDA samples, where you could try a simple Nsight Compute profile to see if anything will give good data?

There were no errors/warnings thrown during profile with Nsight Compute profiler.
I tried to compile CUDA samples, with the link provided by you GitHub - NVIDIA/cuda-samples: Samples for CUDA Developers which demonstrates features in CUDA Toolkit, but it doesn’t compile with cuda-12.0 and any of different compilers used gcc-7.5, gcc-12.x and Nvidia C++ compiler. Every time it failed with same error:

/opt/nvidia/hpc_sdk/Linux_x86_64/23.3/compilers/bin/nvcc -ccbin nvc++ -I../../../Common  -m64    --threads 0 --std=c++11 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_89,code=sm_89 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -o matrixMultiplyPerf.o -c matrixMultiplyPerf.cu
/opt/nvidia/hpc_sdk/Linux_x86_64/23.3/compilers/bin/nvcc -ccbin nvc++   -m64      -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_89,code=sm_89 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -o UnifiedMemoryPerf commonKernels.o helperFunctions.o matrixMultiplyPerf.o
mkdir -p ../../../bin/x86_64/linux/release
cp UnifiedMemoryPerf ../../../bin/x86_64/linux/release
make[1]: Leaving directory '/lus/cflus02/usern1/src/cusamples/cuda-samples/Samples/6_Performance/UnifiedMemoryPerf'
make[1]: Entering directory '/lus/cflus02/usern1/src/cusamples/cuda-samples/Samples/6_Performance/LargeKernelParameter'
/opt/nvidia/hpc_sdk/Linux_x86_64/23.3/compilers/bin/nvcc -ccbin nvc++ -I../../../Common  -m64    --std=c++11 --threads 0 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_89,code=sm_89 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -o LargeKernelParameter.o -c LargeKernelParameter.cu
LargeKernelParameter.cu(58): Error: Formal parameter space overflowed (4104 bytes required, max 4096 bytes allowed) in function _Z13kernelDefault7param_tPi
LargeKernelParameter.cu(75): Error: Formal parameter space overflowed (32008 bytes required, max 4096 bytes allowed) in function _Z16kernelLargeParam13param_large_tPi

LargeKernelParameter.cu(58): Error: Formal parameter space overflowed (4104 bytes required, max 4096 bytes allowed) in function _Z13kernelDefault7param_tPi
LargeKernelParameter.cu(75): Error: Formal parameter space overflowed (32008 bytes required, max 4096 bytes allowed) in function _Z16kernelLargeParam13param_large_tPi

LargeKernelParameter.cu(58): Error: Formal parameter space overflowed (4104 bytes required, max 4096 bytes allowed) in function _Z13kernelDefault7param_tPi

Thanks

Can you try compiling one of the simple samples like the 0_Introduction/MatrixMul sample and then collecting a simple Nsight Compute report? I’m not familiar with your specific sample and what’s going on there. Also, did you also update the driver when moving to cuda 12.0? Can you share your driver version via the “nvidia-smi” command?

The 0_Interoduction/matrixMul compiles well and also generates profiler report without any “NAN” values.

[83117] matrixMul@127.0.0.1
  void MatrixMulCUDA<(int)32>(float *, float *, float *, int, int) (20, 10, 1)x(32, 32, 1), Context 1, Stream 13, Device 0, CC 8.0
    Section: GPU Speed Of Light Throughput
    ----------------------- ------------- ------------
    Metric Name               Metric Unit Metric Value
    ----------------------- ------------- ------------
    DRAM Frequency          cycle/nsecond         1.54
    SM Frequency            cycle/nsecond         1.22
    Elapsed Cycles                  cycle       44,524
    Memory Throughput                   %        70.29
    DRAM Throughput                     %         1.72
    Duration                      usecond        36.32
    L1/TEX Cache Throughput             %        79.18
    L2 Cache Throughput                 %         7.72
    SM Active Cycles                cycle    39,427.31
    Compute (SM) Throughput             %        59.11
    ----------------------- ------------- ------------

This MatrixMul from CUDA Samples 0_Introduction, doesn’t use MPI. It’s just only CUDA. But the program that I’m working on, whose ncu profiler report has “NAN” values, is an MPI + CUDA code and it runs on multi-gpus and multi-nodes.

Here is the driver version:

> nvidia-smi
Wed Oct 18 13:16:44 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+

Can you share some more details about how MPI is used? What implementation (MPICH etc…) and version? Can you share the command used to launch the app with ncu? I.e. there are different variants described here Nsight Compute CLI :: Nsight Compute Documentation. It may be worth trying profiling a rank other than 0 in case something is going on with rank 0.

PF details here:
MPI used[with cuda aware mpi option enabled] is:

Cray-MPICH-8.1.27

The job is run on 4 nodes, 12 ranks per node, 4 gpus per node (i.e. 3 ranks per gpu)

Command Used to launch app:

srun --nodes=4 --ntasks-per-node=12 --export=ALL --mpi=pmi2 --cpu-bind=map_cpu:56,52,48,40,36,32,24,20,16,8,4,0 ./rank0_ncu.sh pw.x -in pw.in

The rank0_ncu.sh is:

if [[ ${SLURM_PROCID} -eq 0 ]]; then
ncu --target-processes all  -o report.ncu.tpall.out.%h.%p  -f -c 100  "$@"

else
        "$@"
fi

Before doing above steps, I’ve followed the steps as described in Nsight Compute CLI :: Nsight Compute Documentation.