Call stack is visible/captured only for some CUDA kernels (broken backtraces)

Hello,

I try out Nsight Systems currently, and have an issue where I can’t manage to get the call stack for most of the CUDA kernels, while some other show fine:

But for example:
image

Why is the call stack showing only for some kernels? How can I get the call stack for all kernels? For context, I am profiling PyTorch code, PyTorch being installed from pip.

I tried setting very low --cudabacktrace=all:100, as well as --osrt-threshold=200 --osrt-backtrace-threshold=5000.

I tried using --sample=system-wide but then the trace is HUGE. I tried as well --backtrace=lbr, --backtrace=fp and --backtrace=dwarf with no success (just one of them that was giving “broken trace”).

Full command I am using currently: nsys profile --gpu-metrics-device=0 -w true -t cuda,nvtx,osrt,cudnn,cublas --sample=process-tree --capture-range=cudaProfilerApi --cudabacktrace=all:50 --osrt-threshold=200 --osrt-backtrace-threshold=5000 -x true -f true -o my_profile python my_script.py


nsys status --environment gives:

Timestamp counter supported: Yes

CPU Profiling Environment Check
Root privilege: disabled
Linux Kernel Paranoid Level = 0
Linux Distribution = Ubuntu
Linux Kernel Version = 5.15.0-56-generic: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): OK

Versions:

  • nsys: 2022.4.2.18-32044700v0
  • CUDA: 11.7
  • Nvidia Driver: 515.86.01

Your help is very much appreciated!

Edit: I suspect I need to build PyTorch from source with -fno-omit-frame-pointer for this to work. Will try it out.

I have these warnings:

Could this be related? Using Nsight Systems to profile GPU workload - #5 by pyotr777 - NVIDIA CUDA - PyTorch Dev Discussions @pyotr777

Compiling PyTorch with:

CMAKE_BUILD_TYPE=RelWithDebInfo CFLAGS="-funwind-tables -fno-omit-frame-pointer" BUILD_TEST=0 BUILD_CAFFE2=0 BUILD_CAFFE2_OPS=0 USE_DISTRIBUTED=0 USE_ROCM=0 USE_FBGEMM=0 USE_QNNPACK=0 python setup.py install

did not help. For what it’s worth, during the install cuDNN is found (but nsight still show nothing about cudNN):

-- Found CUDNN: /usr/local/cuda/lib64/libcudnn.so  
-- Found cuDNN: v8.4.1  (include: /usr/local/cuda/include, library: /usr/local/cuda/lib64/libcudnn.so)

torch.__config__.show() gives me:

PyTorch built with:
  - GCC 11.3
  - C++ Version: 201402
  - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.7
  - NVCC architecture flags: -gencode;arch=compute_86,code=sm_86
  - CuDNN 8.4.1  (built against CUDA 11.6)
  - Magma 2.5.2
  - Build settings: BUILD_TYPE=RelWithDebInfo, CUDA_VERSION=11.7, CUDNN_VERSION=8.4.1, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS=-funwind-tables -fno-omit-frame-pointer -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=0, 

nsys run now with:

nsys profile --gpu-metrics-device=0 -w true -t cuda,nvtx,osrt,cudnn,cublas --capture-range=cudaProfilerApi --backtrace=fp --cudabacktrace=all:500 --osrt-threshold=500 --osrt-backtrace-threshold=10000 -x true -f true -o my_profile python my_script.py

and gives:

When not specifying --backtrace=fp:

The following CUDA flags are used during compile (tried as well to pass -g -funwind-tables -fno-omit-frame-pointer to cuda flags, with no more success):

--     CUDA flags          :  -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_86,code=sm_86 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__

The following CXX flags are used during compile:

--   CXX flags             : -g -funwind-tables -fno-omit-frame-pointer -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow

nvcc V11.7.99 and gcc 11.3.0 are used.

Trying NVIDIA Nsight Systems 2022.5.1.82-32078057v0, and trying cudnn 8.7 did not help.

Are you trying to profiling on process (process tree) or system wide?

There are three methods that Nsight Systems will use to resolve symbols, we will try LBR on Intel CPUs, we will try frame-pointers (if frame pointers are not stripped), or we will try dwarf unwind.

(from the documentation)

Three different backtrace collections options are available when sampling CPU instruction pointers. Backtraces can be generated using Intel (c) Last Branch Record (LBR) registers. LBR backtraces generate minimal overhead but the backtraces have limited depth. Backtraces can also be generated using DWARF debug data. DWARF backtraces incur more overhead than LBR backtraces but have much better depth. Finally, backtraces can be generated using frame pointers. Frame pointer backtraces incur medium overhead and have good depth but only resolve frames in the portions of the application and its libraries (including 3rd party libraries) that were compiled with frame pointers enabled. Normally, frame pointers are disabled by default during compilation.

By default, Nsight Systems will use Intel(c) LBRs if available and fall back to using dwarf unwind if they are not. Choose modes… will allow you to override the default.

Choose backtrace option|-1x-1

The Include child processes switch controls whether API tracing is only for the launched process, or for all existing and new child processes of the launched process. If you are running your application through a script, for example a bash script, you need to set this checkbox.

(end inclusion).

From the CLI, the option is:

-s --sample process-tree, system-wide, none

Select how to collect CPU IP/backtrace samples. If ‘none’ is selected, CPU sampling is disabled. Depending on the platform, some values may require admin or root privileges. If a target application is launched, the default is ‘process-tree’, otherwise, the default is ‘none’. Note: ‘system-wide’ is not available on all platforms. Note: If set to ‘none’, CPU context switch data will still be collected unless the --cpuctxsw switch is set to ‘none’.

Thank you, I mostly tried --sample=process-tree with no success. I tried --sample=system-wide as well, but it resulted in a > 2GB log so it was not usable.

You are just looking for actual CUDA kernel backtraces, yes?

Do you need the other information?

If not, you can use

nsys profile --sample=none --capture-range=cudaProfilerApi --cudabacktrace=all:500 -f true -o my_profile python my_script.py

this will give you CUDA and NVTX information only, start and stop based on the cudaProfilerApi in your code, and collect CUDA backtraces.

It will skip CPU information (including function backtraces) which will prevent the OS throttling warning you are getting. This and not tracing OSRT will cut back on the size of the trace.

Note that we do not trace ALL CUDA kernels, we skip supper short kernels, see User Guide :: Nsight Systems Documentation (it’s actually an exact link, but the forum software makes it look like a general one).