Call stack is visible/captured only for some CUDA kernels (broken backtraces)

fxmarty · December 26, 2022, 7:56pm

Hello,

I try out Nsight Systems currently, and have an issue where I can’t manage to get the call stack for most of the CUDA kernels, while some other show fine:

But for example:

Why is the call stack showing only for some kernels? How can I get the call stack for all kernels? For context, I am profiling PyTorch code, PyTorch being installed from pip.

I tried setting very low --cudabacktrace=all:100, as well as --osrt-threshold=200 --osrt-backtrace-threshold=5000.

I tried using --sample=system-wide but then the trace is HUGE. I tried as well --backtrace=lbr, --backtrace=fp and --backtrace=dwarf with no success (just one of them that was giving “broken trace”).

Full command I am using currently: nsys profile --gpu-metrics-device=0 -w true -t cuda,nvtx,osrt,cudnn,cublas --sample=process-tree --capture-range=cudaProfilerApi --cudabacktrace=all:50 --osrt-threshold=200 --osrt-backtrace-threshold=5000 -x true -f true -o my_profile python my_script.py

nsys status --environment gives:

Timestamp counter supported: Yes

CPU Profiling Environment Check
Root privilege: disabled
Linux Kernel Paranoid Level = 0
Linux Distribution = Ubuntu
Linux Kernel Version = 5.15.0-56-generic: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): OK

Versions:

nsys: 2022.4.2.18-32044700v0
CUDA: 11.7
Nvidia Driver: 515.86.01

Your help is very much appreciated!

Edit: I suspect I need to build PyTorch from source with -fno-omit-frame-pointer for this to work. Will try it out.

fxmarty · December 26, 2022, 8:15pm

I have these warnings:

Could this be related? Using Nsight Systems to profile GPU workload - #5 by pyotr777 - NVIDIA CUDA - PyTorch Dev Discussions @pyotr777

fxmarty · December 27, 2022, 10:25am

Compiling PyTorch with:

CMAKE_BUILD_TYPE=RelWithDebInfo CFLAGS="-funwind-tables -fno-omit-frame-pointer" BUILD_TEST=0 BUILD_CAFFE2=0 BUILD_CAFFE2_OPS=0 USE_DISTRIBUTED=0 USE_ROCM=0 USE_FBGEMM=0 USE_QNNPACK=0 python setup.py install

did not help. For what it’s worth, during the install cuDNN is found (but nsight still show nothing about cudNN):

-- Found CUDNN: /usr/local/cuda/lib64/libcudnn.so  
-- Found cuDNN: v8.4.1  (include: /usr/local/cuda/include, library: /usr/local/cuda/lib64/libcudnn.so)

torch.__config__.show() gives me:

PyTorch built with:
  - GCC 11.3
  - C++ Version: 201402
  - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.7
  - NVCC architecture flags: -gencode;arch=compute_86,code=sm_86
  - CuDNN 8.4.1  (built against CUDA 11.6)
  - Magma 2.5.2
  - Build settings: BUILD_TYPE=RelWithDebInfo, CUDA_VERSION=11.7, CUDNN_VERSION=8.4.1, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS=-funwind-tables -fno-omit-frame-pointer -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=0,

nsys run now with:

nsys profile --gpu-metrics-device=0 -w true -t cuda,nvtx,osrt,cudnn,cublas --capture-range=cudaProfilerApi --backtrace=fp --cudabacktrace=all:500 --osrt-threshold=500 --osrt-backtrace-threshold=10000 -x true -f true -o my_profile python my_script.py

and gives:

When not specifying --backtrace=fp:

The following CUDA flags are used during compile (tried as well to pass -g -funwind-tables -fno-omit-frame-pointer to cuda flags, with no more success):

--     CUDA flags          :  -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_86,code=sm_86 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__

The following CXX flags are used during compile:

--   CXX flags             : -g -funwind-tables -fno-omit-frame-pointer -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow

nvcc V11.7.99 and gcc 11.3.0 are used.

Trying NVIDIA Nsight Systems 2022.5.1.82-32078057v0, and trying cudnn 8.7 did not help.

hwilper · December 28, 2022, 7:25pm

Are you trying to profiling on process (process tree) or system wide?

There are three methods that Nsight Systems will use to resolve symbols, we will try LBR on Intel CPUs, we will try frame-pointers (if frame pointers are not stripped), or we will try dwarf unwind.

(from the documentation)

Three different backtrace collections options are available when sampling CPU instruction pointers. Backtraces can be generated using Intel (c) Last Branch Record (LBR) registers. LBR backtraces generate minimal overhead but the backtraces have limited depth. Backtraces can also be generated using DWARF debug data. DWARF backtraces incur more overhead than LBR backtraces but have much better depth. Finally, backtraces can be generated using frame pointers. Frame pointer backtraces incur medium overhead and have good depth but only resolve frames in the portions of the application and its libraries (including 3rd party libraries) that were compiled with frame pointers enabled. Normally, frame pointers are disabled by default during compilation.

By default, Nsight Systems will use Intel(c) LBRs if available and fall back to using dwarf unwind if they are not. Choose modes… will allow you to override the default.

The Include child processes switch controls whether API tracing is only for the launched process, or for all existing and new child processes of the launched process. If you are running your application through a script, for example a bash script, you need to set this checkbox.

(end inclusion).

From the CLI, the option is:

-s --sample process-tree, system-wide, none

Select how to collect CPU IP/backtrace samples. If ‘none’ is selected, CPU sampling is disabled. Depending on the platform, some values may require admin or root privileges. If a target application is launched, the default is ‘process-tree’, otherwise, the default is ‘none’. Note: ‘system-wide’ is not available on all platforms. Note: If set to ‘none’, CPU context switch data will still be collected unless the --cpuctxsw switch is set to ‘none’.

fxmarty · December 29, 2022, 8:40am

Thank you, I mostly tried --sample=process-tree with no success. I tried --sample=system-wide as well, but it resulted in a > 2GB log so it was not usable.

hwilper · December 29, 2022, 3:14pm

You are just looking for actual CUDA kernel backtraces, yes?

Do you need the other information?

If not, you can use

nsys profile --sample=none --capture-range=cudaProfilerApi --cudabacktrace=all:500 -f true -o my_profile python my_script.py

this will give you CUDA and NVTX information only, start and stop based on the cudaProfilerApi in your code, and collect CUDA backtraces.

It will skip CPU information (including function backtraces) which will prevent the OS throttling warning you are getting. This and not tracing OSRT will cut back on the size of the trace.

Note that we do not trace ALL CUDA kernels, we skip supper short kernels, see User Guide :: Nsight Systems Documentation (it’s actually an exact link, but the forum software makes it look like a general one).

Topic		Replies	Views
Can not get CUDA python backtrace Profiling Linux Targets	12	1804	May 7, 2023
Profling a simple deep learning code : no python backtrace + cannot use cudnn trace Profiling x86 Windows Targets cudnn	19	1126	December 13, 2023
Generating CUPTI_* tables with nsys Profiling Linux Targets cuda	25	1651	January 12, 2023
Cannot collect CUDA trace data Profiling Linux Targets	7	1193	November 20, 2023
Kernel call stack Profiling Linux Targets	6	972	March 21, 2023
Unable to capture "Can't find UUID for CUDA device" Profiling Linux Targets	10	2287	November 9, 2023
Option to profile only master process Nsight Compute cuda	23	3165	December 1, 2023
Nsight Systems Issue: Unable to configure the collection of CPU IP samples Profiling Linux Targets	12	8742	December 27, 2021
Nsight-Compute returns “No kernels were profiled” warning Nsight Compute	9	1289	July 27, 2023
Nsight Compute not detecting kernel launch Nsight Compute profiling	13	3021	May 6, 2021

Call stack is visible/captured only for some CUDA kernels (broken backtraces)

Related topics