Hello,
I’m working with Nsight Compute/Cuda-11.7 version to profile a Fortran application. The runtime of the job is 3 minutes. But with Nsight compute profiler it doesn’t finish even in 4 hours. So looking for options to reduce profiling time
(1) profile kernels from a single process i.e. process with MPI Rank 0.
(2) profile only application specific kernels. i.e. skip kernels such as “__pgi_dev_cumemset_4n”
Are there any such options available with Nsight Compute?
You can find further information on these topics here and here in the documentation.
The Nsight Systems (nsys) profiler output shows that following kernel consumes most of runtime:
[7/9] Executing 'cuda_gpu_kern_sum' stats report
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------- ------------- ----------- ----------- ------------ ----------------------------------------------------------------------------------------------------
17.2 35,060,287,713 392 89,439,509.5 73,738,924.0 36,020,468 274,459,150 31,305,026.5 void cutlass::Kernel<cutlass_80_tensorop_d884gemm_128x64_16x3_nn_align1>(T1::Params)
But the same kernel name can not be profiled under Nsight Compute i.e. ncu
export s_ncuprof_cmd=" ncu -o report.ncu.out --target-processes all -f -k cutlass::Kernel<cutlass_80_tensorop_d884gemm_64x64_16x4_tn_align1>(T1::Params) "
It says:
FORTRAN STOP
==PROF== Disconnected from process 79447
==WARNING== No kernels were profiled.
Please suggest, how to profile a kernel as listed out by NSYS output in NCU profiler.
Thanks
You have to specify the --kernel-name-base
option to tell ncu which version of the kernel/function name it should consider for matching the filter. In your case, it would be --kernel-name-base demangled
. Note that you can furthermore use regular expressions with the regex:
prefix (as clarified in the docs) with -k
.
Now with these NCU options
ncu --target-processes all -o report.ncu.tpall.out.%h.%p -f --kernel-name regex:cutlass_80_tensorop --kernel-name-base demangled
It fails with following error:
==ERROR== UnknownError
==ERROR== Failed to profile kernel "Kernel" in process 61987
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
srun: error: ak0038: task 20: Exited with exit code 9
srun: error: ak0040: task 39: Exited with exit code 9
==PROF== Profiling "void cutlass::Kernel<cutlass_80_tensorop_d884gemm_64x64_16x4_tn_align1>(T1::Params)": 0%
It’s an MPI code and srun[slurm] is used to run the job on nodes. However it works, if profiled only Rank 0 process:
if [[ ${SLURM_PROCID} -eq 0 ]]; then
ncu --target-processes=all -o report.ncu.tpall.out.%h.%p -f --kernel-name regex:cutlass_80_tensorop --kernel-name-base demangled "$@"
fi
Any option need to be used for multi-process profiling?
In general, we recommend profiling a single rank per node for MPI applications. Nsight Compute, by default, serializes kernels within a process, replays them, and needs a global lock on the profiling hardware. Because of this, multi-process applications with communication etc… like MPI applications, can often hang or fail when trying to profile multiple ranks in parallel.
Ok Thanks.
Without NCU profiler, the job’s run time is 3 minutes. But with NCU, and single kernel profiling [ncu --target-processes=all -o report.ncu.tpall.out.%h.%p -f --kernel-name regex:cutlass_80_tensorop --kernel-name-base demangled] it doesn’t complete in 4 hours also. Please suggest what option to use for a lesser profiling time and simultaneously get general characteristics[not necessarily looking to improve performance] of the code on gpu utilization. Thanks
It’s likely that you’re profiling many instances of the same kernel. One thing to try is profiling a subset using the --launch-count flag. You can optionally add the --launch-skip and --launch-skip-before-match flags. See here for more details on that Nsight Compute CLI :: Nsight Compute Documentation
Thanks.
Our objective of profiling this Fortran application is: to get general characteristics of code such as
- Time split between CPU & GPU.
- Summary of time spent in MPI, Data transfer between Host & Device and vice versa.
- % of each GPU Utilization.
- Memory Utilization etc
To get these details, which tool you suggest (1) nsys (2) ncu?
Nsight Systems is going to give you more of this type of information than Nsight Compute.
The imported output from a Nsight Compute profiler report shows some of the values as “nan”:
void cutlass::Kernel<cutlass_80_tensorop_d884gemm_64x32_16x4_tn_align1>(T1::Params), 2023-Sep-06 01:39:44, Context 1, Stream 7
Section: GPU Speed Of Light Throughput
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/second (!) nan
SM Frequency cycle/second (!) nan
Elapsed Cycles cycle (!) nan
Memory [%] % (!) nan
DRAM Throughput % (!) nan
Duration usecond 345.06
L1/TEX Cache Throughput % (!) nan
L2 Cache Throughput % (!) nan
SM Active Cycles cycle (!) nan
Compute (SM) [%] % (!) nan
---------------------------------------------------------------------- --------------- ------------------------------
A single kernel: “cutlass::Kernel<cutlass_80_tensorop_d884gemm_64x32_16x4_tn_align1>(T1::Params)” is profiled using only 1 process [with mpi rank 0]. 12 ranks/node, 4 gpus/node, total 4 nodes.
Why some values are “nan” here?
Can you attach the Nsight Compute report to this issue? 11.7 is a very old version of Nsight Compute. Are you able to try the newest version?