Warning: Cuda API error detected: cuModuleLoadFatBinary returned (0xd1)

Hello, I have RTX Ada 6000 48GB gpu installed. and I am planning to setup a local environment here. somehow. TensorFlow giving issues. I have installed cuda 11.8, cudnn 8.7 and also tensorrt. but tensorflow-gpu still couldn’t recognize my GPU. can you please guide me how to successfully setup a local environment ? I originally installed tensorflow 2.16 but since it ddnt worked I downgraded to 2.15. somehow it is abled to detect my GPU. but when I do some training, I get out of memory error from cuda driver

@kavindamadhujith Please start a new post as your issue is more like a setup issue. Note this is forum support for develop tools - cuda-gdb. Please post your problem under “CUDA set up and installation” forum. You can get support there. Thanks !

Hi, @892516165

Sorry for the late response. We built your source code internal successfully.
And we can also see “warning: Cuda API error detected: cuModuleLoadFatBinary returned (0xd1)” printed multi times, but it doesn’t impact the debug. For example:

warning: Cuda API error detected: cuModuleLoadFatBinary returned (0xd1)

warning: Cuda API error detected: cuModuleLoadFatBinary returned (0xd1)

warning: Cuda API error detected: cuModuleLoadFatBinary returned (0xd1)

warning: Cuda API error detected: cuModuleLoadFatBinary returned (0xd1)

warning: Cuda API error detected: cuModuleLoadFatBinary returned (0xd1)

warning: Cuda API error detected: cuModuleLoadFatBinary returned (0xd1)

warning: Cuda API error detected: cuModuleLoadFatBinary returned (0xd1)

[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]

CUDA thread hit application kernel entry function breakpoint, 0x00007fffb5313600 in void cusparselt::pruning_kernel<64u, 8u, (cusparselt::(anonymous namespace)::Sparsity)1, true, __half>(__half const*, __half*, long, long, long, long)
<<<(1,1,1),(64,8,1)>>> ()
(cuda-gdb)
(cuda-gdb)
(cuda-gdb)
(cuda-gdb) l
62 }
63
64 define CHECK_CUSPARSE(func)
65 {
66 cusparseStatus_t status = (func);
67 if (status != CUSPARSE_STATUS_SUCCESS) {
68 printf(“CUSPARSE API failed at line %d with error: %s (%d)\n”,
69 LINE, cusparseGetErrorString(status), status);
70 return EXIT_FAILURE;
71 }
(cuda-gdb) n
Single stepping until exit from function _ZN10cusparselt14pruning_kernelILj64ELj8ELNS_43_GLOBAL__N__47f2a60d_10_pruning_cu_35f6c0908SparsityE1ELb1E6__halfEEvPKT3_PS4_llll,
which has no line number information.
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (32,0,0), device 0, sm 0, warp 1, lane 0]
0x00007fffb5313600 in void cusparselt::pruning_kernel<64u, 8u, (cusparselt::(anonymous namespace)::Sparsity)1, true, __half>(__half const*, __half*, long, long, long, long)<<<(1,1,1),(64,8,1)>>> ()
(cuda-gdb)

So in your case, I’m thinking is it possible that you get latest CUDA12.5 package to rebuild and debug ?

FYI, the compile command we use is “nvcc -lcusparseLt -lcusparse -ldl -gencode arch=compute_80,code=sm_80 -g -G -o test a.cu” (We are using Ampere so we choose 80 here)

Also we install cusparselt from here: cuSPARSELt Downloads | NVIDIA Developer

After reinstalling cusparseLt, it still fails to run properly. I’m wondering if the issue might be related to environment variables or WSL (Windows Subsystem for Linux). Here’s my .bashrc configuration:

export TVM_LOG_DEBUG="ir/transform.cc=1,relay/ir/transform.cc=1"
export PATH="/home/whh/environ/bsc/bin:$PATH"
export LIBRARY="/home/whh/environ/bsc/lib:$LIBRARY_PATH"

export TVM_HOME=/home/whh/tvm
export PYTHONPATH=$TVM_HOME/python:${PYTHONPATH}
export PYTHONPATH=$TVM_HOME/src:${PYTHONPATH}

#export PATH=/usr/local/cuda-12.1/bin${PATH:+:${PATH}}
export PATH=/usr/local/cuda-12.5/bin${PATH:+:${PATH}}
#export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:/usr/local/lib:/home/whh/anaconda3/envs/caffe_test/lib${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-12.5/lib64:/usr/local/lib:/home/whh/anaconda3/envs/caffe_test/lib${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
#export CUDA_HOME=/usr/local/cuda-12.1:$CUDA_HOME
export CUDA_HOME=/usr/local/cuda-12.5:$CUDA_HOME
# export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
#export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-12.5/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
# cuda gdb on wsl
#export CUDBG_USE_LEGACY_DEBUGGER=1
# export QT_DEBUG_PLUGINS=1
# 不确定是否需要配置
# export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/:${LD_LIBRARY_PATH}
#export CUDA_TOOLKIT_PATH=/usr/local/cuda-12.1:${CUDA_TOOLKIT_PATH}
export CUDA_TOOLKIT_PATH=/usr/local/cuda-12.5:${CUDA_TOOLKIT_PATH}
#export CUDA_VISIBLE_DEVICES="0"
#export CUDA_DEBUGGER_SOFTWARE_PREEMPTION=1
#export CUDA_LAUNCH_BLOCKING=1
#export CUSPARSELT_DIR=/home/whh/workspace/env/cusparseLt
export CUSPARSELT_DIR=/home/whh/workspace/env/libcusparse_lt-linux-x86_64-0.6.1.0-archive
#export LD_LIBRARY_PATH=${CUSPARSELT_DIR}/lib64:${LD_LIBRARY_PATH}
export LD_LIBRARY_PATH=${CUSPARSELT_DIR}/lib:${LD_LIBRARY_PATH}

# LAPACK
export LIBRARY_PATH="$LIBRARY_PATH:~/workspace/env/lapack-3.12.0"
export C_INCLUDE_PATH="$C_INCLUDE_PATH:~/workspace/env/lapack-3.12.0/LAPACKE/include:~/workspace/env/lapack-3.12.0/CBLAS/include"
export LAPACK_LIBRARIES="/home/whh/workspace/env/lapack-3.12.0/liblapack.a"
export BLAS_LIBRARIES="/home/whh/workspace/env/lapack-3.12.0/libcblas.a"

#export NVLOG_CONFIG_FILE=${HOME}/workspace/env/nvlog.config
#export NVLOG_CONFIG_FILE=${HOME}/nvlog.local.config

Moreover, even when I comment out the NVLOG_CONFIG_FILE line, cuda-gdb keeps outputting error messages during usage.

Hi, @892516165

I noticed you set “export CUDBG_USE_LEGACY_DEBUGGER=1”
Please remove this and try.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.