Compiling PyTorch with:
CMAKE_BUILD_TYPE=RelWithDebInfo CFLAGS="-funwind-tables -fno-omit-frame-pointer" BUILD_TEST=0 BUILD_CAFFE2=0 BUILD_CAFFE2_OPS=0 USE_DISTRIBUTED=0 USE_ROCM=0 USE_FBGEMM=0 USE_QNNPACK=0 python setup.py install
did not help. For what it’s worth, during the install cuDNN is found (but nsight still show nothing about cudNN):
-- Found CUDNN: /usr/local/cuda/lib64/libcudnn.so
-- Found cuDNN: v8.4.1 (include: /usr/local/cuda/include, library: /usr/local/cuda/lib64/libcudnn.so)
torch.__config__.show()
gives me:
PyTorch built with:
- GCC 11.3
- C++ Version: 201402
- Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.7
- NVCC architecture flags: -gencode;arch=compute_86,code=sm_86
- CuDNN 8.4.1 (built against CUDA 11.6)
- Magma 2.5.2
- Build settings: BUILD_TYPE=RelWithDebInfo, CUDA_VERSION=11.7, CUDNN_VERSION=8.4.1, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS=-funwind-tables -fno-omit-frame-pointer -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=0,
nsys run now with:
nsys profile --gpu-metrics-device=0 -w true -t cuda,nvtx,osrt,cudnn,cublas --capture-range=cudaProfilerApi --backtrace=fp --cudabacktrace=all:500 --osrt-threshold=500 --osrt-backtrace-threshold=10000 -x true -f true -o my_profile python my_script.py
and gives:
When not specifying --backtrace=fp
:
The following CUDA flags are used during compile (tried as well to pass -g -funwind-tables -fno-omit-frame-pointer
to cuda flags, with no more success):
-- CUDA flags : -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_86,code=sm_86 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__
The following CXX flags are used during compile:
-- CXX flags : -g -funwind-tables -fno-omit-frame-pointer -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow
nvcc V11.7.99 and gcc 11.3.0 are used.
Trying NVIDIA Nsight Systems 2022.5.1.82-32078057v0, and trying cudnn 8.7 did not help.