Option to profile only master process

sangames · August 29, 2023, 5:31am

Hello,

I’m working with Nsight Compute/Cuda-11.7 version to profile a Fortran application. The runtime of the job is 3 minutes. But with Nsight compute profiler it doesn’t finish even in 4 hours. So looking for options to reduce profiling time
(1) profile kernels from a single process i.e. process with MPI Rank 0.
(2) profile only application specific kernels. i.e. skip kernels such as “__pgi_dev_cumemset_4n”

Are there any such options available with Nsight Compute?

felix_dt · August 29, 2023, 6:09am

You can find further information on these topics here and here in the documentation.

sangames · August 31, 2023, 8:17am

The Nsight Systems (nsys) profiler output shows that following kernel consumes most of runtime:

[7/9] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)                                                   Name    
 --------  ---------------  ---------  -------------  -------------  -----------  -----------  ------------  ----------------------------------------------------------------------------------------------------
     17.2   35,060,287,713        392   89,439,509.5   73,738,924.0   36,020,468  274,459,150  31,305,026.5  void cutlass::Kernel<cutlass_80_tensorop_d884gemm_128x64_16x3_nn_align1>(T1::Params)

But the same kernel name can not be profiled under Nsight Compute i.e. ncu

export s_ncuprof_cmd=" ncu -o report.ncu.out --target-processes all -f -k cutlass::Kernel<cutlass_80_tensorop_d884gemm_64x64_16x4_tn_align1>(T1::Params) "
It says:

FORTRAN STOP
==PROF== Disconnected from process 79447
==WARNING== No kernels were profiled.

Please suggest, how to profile a kernel as listed out by NSYS output in NCU profiler.

Thanks

felix_dt · August 31, 2023, 8:31am

You have to specify the --kernel-name-base option to tell ncu which version of the kernel/function name it should consider for matching the filter. In your case, it would be --kernel-name-base demangled. Note that you can furthermore use regular expressions with the regex: prefix (as clarified in the docs) with -k.

sangames · August 31, 2023, 10:59am

Now with these NCU options

ncu --target-processes all -o report.ncu.tpall.out.%h.%p -f --kernel-name regex:cutlass_80_tensorop --kernel-name-base demangled

It fails with following error:

==ERROR== UnknownError
==ERROR== Failed to profile kernel "Kernel" in process 61987
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
srun: error: ak0038: task 20: Exited with exit code 9
srun: error: ak0040: task 39: Exited with exit code 9
==PROF== Profiling "void cutlass::Kernel<cutlass_80_tensorop_d884gemm_64x64_16x4_tn_align1>(T1::Params)": 0%

It’s an MPI code and srun[slurm] is used to run the job on nodes. However it works, if profiled only Rank 0 process:

if [[ ${SLURM_PROCID} -eq 0 ]]; then
        ncu --target-processes=all -o   report.ncu.tpall.out.%h.%p  -f --kernel-name regex:cutlass_80_tensorop  --kernel-name-base demangled "$@"
fi

Any option need to be used for multi-process profiling?

jmarusarz · August 31, 2023, 8:15pm

In general, we recommend profiling a single rank per node for MPI applications. Nsight Compute, by default, serializes kernels within a process, replays them, and needs a global lock on the profiling hardware. Because of this, multi-process applications with communication etc… like MPI applications, can often hang or fail when trying to profile multiple ranks in parallel.

sangames · September 1, 2023, 7:17am

Ok Thanks.

Without NCU profiler, the job’s run time is 3 minutes. But with NCU, and single kernel profiling [ncu --target-processes=all -o report.ncu.tpall.out.%h.%p -f --kernel-name regex:cutlass_80_tensorop --kernel-name-base demangled] it doesn’t complete in 4 hours also. Please suggest what option to use for a lesser profiling time and simultaneously get general characteristics[not necessarily looking to improve performance] of the code on gpu utilization. Thanks

jmarusarz · September 1, 2023, 4:09pm

It’s likely that you’re profiling many instances of the same kernel. One thing to try is profiling a subset using the --launch-count flag. You can optionally add the --launch-skip and --launch-skip-before-match flags. See here for more details on that Nsight Compute CLI :: Nsight Compute Documentation

sangames · September 2, 2023, 8:07pm

Thanks.

Our objective of profiling this Fortran application is: to get general characteristics of code such as

Time split between CPU & GPU.
Summary of time spent in MPI, Data transfer between Host & Device and vice versa.
% of each GPU Utilization.
Memory Utilization etc

To get these details, which tool you suggest (1) nsys (2) ncu?

jmarusarz · September 5, 2023, 4:10pm

Nsight Systems is going to give you more of this type of information than Nsight Compute.

sangames · September 6, 2023, 8:44am

The imported output from a Nsight Compute profiler report shows some of the values as “nan”:

 void cutlass::Kernel<cutlass_80_tensorop_d884gemm_64x32_16x4_tn_align1>(T1::Params), 2023-Sep-06 01:39:44, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                            cycle/second                        (!) nan
    SM Frequency                                                              cycle/second                        (!) nan
    Elapsed Cycles                                                                   cycle                        (!) nan
    Memory [%]                                                                           %                        (!) nan
    DRAM Throughput                                                                      %                        (!) nan
    Duration                                                                       usecond                         345.06
    L1/TEX Cache Throughput                                                              %                        (!) nan
    L2 Cache Throughput                                                                  %                        (!) nan
    SM Active Cycles                                                                 cycle                        (!) nan
    Compute (SM) [%]                                                                     %                        (!) nan
    ---------------------------------------------------------------------- --------------- ------------------------------

A single kernel: “cutlass::Kernel<cutlass_80_tensorop_d884gemm_64x32_16x4_tn_align1>(T1::Params)” is profiled using only 1 process [with mpi rank 0]. 12 ranks/node, 4 gpus/node, total 4 nodes.

Why some values are “nan” here?

jmarusarz · September 19, 2023, 4:19pm

Can you attach the Nsight Compute report to this issue? 11.7 is a very old version of Nsight Compute. Are you able to try the newest version?

sangames · October 1, 2023, 11:40am

As suggested, rebuilt the code with cuda-12.0.x version and profiled the code with 12.0.x Nsight compute. But still this version also produces “nan” values in the report output:

[14549] pw.x@127.0.0.1
  void cutlass::Kernel<cutlass_80_tensorop_d884gemm_64x64_16x4_tn_align1>(T1::Params) (19, 1, 11)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 8.0
    Section: GPU Speed Of Light Throughput
    ----------------------- ------------ ------------
    Metric Name              Metric Unit Metric Value
    ----------------------- ------------ ------------
    DRAM Frequency          cycle/second      (!) nan
    SM Frequency            cycle/second      (!) nan
    Elapsed Cycles                 cycle      (!) nan
    Memory Throughput                  %      (!) nan
    DRAM Throughput                    %      (!) nan
    Duration                     msecond        10.62
    L1/TEX Cache Throughput            %      (!) nan
    L2 Cache Throughput                %      (!) nan
    SM Active Cycles               cycle      (!) nan
    Compute (SM) Throughput            %      (!) nan
    ----------------------- ------------ ------------

Please suggest how to get a report with accurate values. i.e. They shouldn’t contain “nan” values.

Thanks

sangames · October 3, 2023, 6:19am

The ncu report is attached herewith. Note that it is profiled with mpi rank 0 only.
report.ncu.tpall.out.NOAD0033.14542.ncu-rep.zip (67.6 KB)

jmarusarz · October 5, 2023, 8:21pm

Are there any warnings or errors printed to the console when you run this profile. This does seem strange. Do you have any small kernels, like the CUDA samples, where you could try a simple Nsight Compute profile to see if anything will give good data?

sangames · October 9, 2023, 9:06am

There were no errors/warnings thrown during profile with Nsight Compute profiler.
I tried to compile CUDA samples, with the link provided by you GitHub - NVIDIA/cuda-samples: Samples for CUDA Developers which demonstrates features in CUDA Toolkit, but it doesn’t compile with cuda-12.0 and any of different compilers used gcc-7.5, gcc-12.x and Nvidia C++ compiler. Every time it failed with same error:

/opt/nvidia/hpc_sdk/Linux_x86_64/23.3/compilers/bin/nvcc -ccbin nvc++ -I../../../Common  -m64    --threads 0 --std=c++11 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_89,code=sm_89 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -o matrixMultiplyPerf.o -c matrixMultiplyPerf.cu
/opt/nvidia/hpc_sdk/Linux_x86_64/23.3/compilers/bin/nvcc -ccbin nvc++   -m64      -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_89,code=sm_89 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -o UnifiedMemoryPerf commonKernels.o helperFunctions.o matrixMultiplyPerf.o
mkdir -p ../../../bin/x86_64/linux/release
cp UnifiedMemoryPerf ../../../bin/x86_64/linux/release
make[1]: Leaving directory '/lus/cflus02/usern1/src/cusamples/cuda-samples/Samples/6_Performance/UnifiedMemoryPerf'
make[1]: Entering directory '/lus/cflus02/usern1/src/cusamples/cuda-samples/Samples/6_Performance/LargeKernelParameter'
/opt/nvidia/hpc_sdk/Linux_x86_64/23.3/compilers/bin/nvcc -ccbin nvc++ -I../../../Common  -m64    --std=c++11 --threads 0 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_89,code=sm_89 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -o LargeKernelParameter.o -c LargeKernelParameter.cu
LargeKernelParameter.cu(58): Error: Formal parameter space overflowed (4104 bytes required, max 4096 bytes allowed) in function _Z13kernelDefault7param_tPi
LargeKernelParameter.cu(75): Error: Formal parameter space overflowed (32008 bytes required, max 4096 bytes allowed) in function _Z16kernelLargeParam13param_large_tPi

LargeKernelParameter.cu(58): Error: Formal parameter space overflowed (4104 bytes required, max 4096 bytes allowed) in function _Z13kernelDefault7param_tPi
LargeKernelParameter.cu(75): Error: Formal parameter space overflowed (32008 bytes required, max 4096 bytes allowed) in function _Z16kernelLargeParam13param_large_tPi

LargeKernelParameter.cu(58): Error: Formal parameter space overflowed (4104 bytes required, max 4096 bytes allowed) in function _Z13kernelDefault7param_tPi

Thanks

jmarusarz · October 12, 2023, 8:13pm

Can you try compiling one of the simple samples like the 0_Introduction/MatrixMul sample and then collecting a simple Nsight Compute report? I’m not familiar with your specific sample and what’s going on there. Also, did you also update the driver when moving to cuda 12.0? Can you share your driver version via the “nvidia-smi” command?

sangames · October 18, 2023, 6:17pm

The 0_Interoduction/matrixMul compiles well and also generates profiler report without any “NAN” values.

[83117] matrixMul@127.0.0.1
  void MatrixMulCUDA<(int)32>(float *, float *, float *, int, int) (20, 10, 1)x(32, 32, 1), Context 1, Stream 13, Device 0, CC 8.0
    Section: GPU Speed Of Light Throughput
    ----------------------- ------------- ------------
    Metric Name               Metric Unit Metric Value
    ----------------------- ------------- ------------
    DRAM Frequency          cycle/nsecond         1.54
    SM Frequency            cycle/nsecond         1.22
    Elapsed Cycles                  cycle       44,524
    Memory Throughput                   %        70.29
    DRAM Throughput                     %         1.72
    Duration                      usecond        36.32
    L1/TEX Cache Throughput             %        79.18
    L2 Cache Throughput                 %         7.72
    SM Active Cycles                cycle    39,427.31
    Compute (SM) Throughput             %        59.11
    ----------------------- ------------- ------------

This MatrixMul from CUDA Samples 0_Introduction, doesn’t use MPI. It’s just only CUDA. But the program that I’m working on, whose ncu profiler report has “NAN” values, is an MPI + CUDA code and it runs on multi-gpus and multi-nodes.

Here is the driver version:

> nvidia-smi
Wed Oct 18 13:16:44 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+

jmarusarz · October 19, 2023, 8:13pm

Can you share some more details about how MPI is used? What implementation (MPICH etc…) and version? Can you share the command used to launch the app with ncu? I.e. there are different variants described here Nsight Compute CLI :: Nsight Compute Documentation. It may be worth trying profiling a rank other than 0 in case something is going on with rank 0.

sangames · October 20, 2023, 7:20am

PF details here:
MPI used[with cuda aware mpi option enabled] is:

Cray-MPICH-8.1.27

The job is run on 4 nodes, 12 ranks per node, 4 gpus per node (i.e. 3 ranks per gpu)

Command Used to launch app:

srun --nodes=4 --ntasks-per-node=12 --export=ALL --mpi=pmi2 --cpu-bind=map_cpu:56,52,48,40,36,32,24,20,16,8,4,0 ./rank0_ncu.sh pw.x -in pw.in

The rank0_ncu.sh is:

if [[ ${SLURM_PROCID} -eq 0 ]]; then
ncu --target-processes all  -o report.ncu.tpall.out.%h.%p  -f -c 100  "$@"

else
        "$@"
fi

Before doing above steps, I’ve followed the steps as described in Nsight Compute CLI :: Nsight Compute Documentation.

Topic		Replies	Views
Nsight-Compute returns “No kernels were profiled” warning Nsight Compute	9	1725	July 27, 2023
Question about profiling nccl kernels with Nsight Compute Nsight Compute	23	5713	December 24, 2025
Can't Get NCU GUI To Import Properly Nsight Compute	8	1544	October 5, 2020
Nsight Compute not detecting kernel launch Nsight Compute profiling	13	3333	May 6, 2021
NSight Compute not finding kernels Nsight Compute	24	1059	October 24, 2024
Kernel output all correct but got NAN when profiling with nsight-compute Nsight Compute cuda	5	980	January 12, 2024
Cannot profile kernel from CUDA samples Nsight Compute	6	602	May 31, 2023
Nsight-compute print "the application returned an error code (249)" Nsight Compute	5	1593	February 13, 2023
==ERROR== Failed to profile "createVersionVisualization" in process 12840 Nsight Compute	21	250	September 1, 2025
Nsight compute option to profile only 1 process Nsight Compute cuda	1	415	August 28, 2023

Option to profile only master process

Related topics