Cudnn_status_not_initialized

I install Cuda 11 with driver 460.39 and cudnn 8.1.1, but when I run my code, there is error:
File “/home/dung/miniconda3/envs/YoloV3/lib/python3.8/site-packages/torch/nn/modules/conv.py”, line 395, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

I change to another code repository and test, but also meet the same error, so I believe something is wrong with cudnn
I have follow all step to install CUDNN following the Tar method at Installation Guide :: NVIDIA Deep Learning cuDNN Documentation
After install, my cude locate at /usr/lib/cuda on Ubuntus 20.10

I cannot test with file mnistCUDNN since I install with Tar method.
I don’t know what is wrong. Please help me

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post. Also, please post the output of the queryDevice cuda sample.

Hi, below is the nvinvidia-bug-report.log.gz (341.3 KB)

However,I’m not sure what you mean about queryDevice cuda sample. Can you explain?

Driver looks fine.
When you install cuda, some sample applications are installed alongside it in the cuda directory, subdirectory samples (or similar). Find it and run ./queryDevice

Hi, I cannot find similar file with command “locate queryDevice”
I follow instruction there drivers - How do you install CUDA 11 on Ubuntu 20.10 and verify the installation - Ask Ubuntu

Sorry, it’s called deviceQuery, I always confuse that.

I cannot find that file as well, nvidia-smi and nvcc -V works normally.

I tried to install cuda follow instruction of ubuntu 20.04 (mine is 20.10), and it is still similar to the previous one, the same error, although now the location of cuda become /usr/local/cuda. I still cannot find deviceQuery file as you said. Below is my screenshot of cuda folder. Can you check if it miss any file?

I was able to find deviceQuery under usr/local/cuda/extras/demo_suite
I am also running into the same error and I can’t figure out the solution. Here is my output to deviceQuery:
CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 8 CUDA Capable device(s)

Device 0: “GeForce RTX 2080 Ti”
CUDA Driver Version / Runtime Version 11.0 / 10.1
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 11019 MBytes (11554717696 bytes)
(68) Multiprocessors, ( 64) CUDA Cores/MP: 4352 CUDA Cores
GPU Max Clock rate: 1545 MHz (1.54 GHz)
Memory Clock rate: 7000 Mhz
Memory Bus Width: 352-bit
L2 Cache Size: 5767168 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 26 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: “GeForce RTX 2080 Ti”
CUDA Driver Version / Runtime Version 11.0 / 10.1
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 11019 MBytes (11554717696 bytes)
(68) Multiprocessors, ( 64) CUDA Cores/MP: 4352 CUDA Cores
GPU Max Clock rate: 1545 MHz (1.54 GHz)
Memory Clock rate: 7000 Mhz
Memory Bus Width: 352-bit
L2 Cache Size: 5767168 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 27 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 2: “GeForce RTX 2080 Ti”
CUDA Driver Version / Runtime Version 11.0 / 10.1
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 11019 MBytes (11554717696 bytes)
(68) Multiprocessors, ( 64) CUDA Cores/MP: 4352 CUDA Cores
GPU Max Clock rate: 1545 MHz (1.54 GHz)
Memory Clock rate: 7000 Mhz
Memory Bus Width: 352-bit
L2 Cache Size: 5767168 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 61 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 3: “GeForce RTX 2080 Ti”
CUDA Driver Version / Runtime Version 11.0 / 10.1
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 11019 MBytes (11554717696 bytes)
(68) Multiprocessors, ( 64) CUDA Cores/MP: 4352 CUDA Cores
GPU Max Clock rate: 1545 MHz (1.54 GHz)
Memory Clock rate: 7000 Mhz
Memory Bus Width: 352-bit
L2 Cache Size: 5767168 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 62 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 4: “GeForce RTX 2080 Ti”
CUDA Driver Version / Runtime Version 11.0 / 10.1
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 11019 MBytes (11554717696 bytes)
(68) Multiprocessors, ( 64) CUDA Cores/MP: 4352 CUDA Cores
GPU Max Clock rate: 1545 MHz (1.54 GHz)
Memory Clock rate: 7000 Mhz
Memory Bus Width: 352-bit
L2 Cache Size: 5767168 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 136 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 5: “GeForce RTX 2080 Ti”
CUDA Driver Version / Runtime Version 11.0 / 10.1
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 11019 MBytes (11554717696 bytes)
(68) Multiprocessors, ( 64) CUDA Cores/MP: 4352 CUDA Cores
GPU Max Clock rate: 1545 MHz (1.54 GHz)
Memory Clock rate: 7000 Mhz
Memory Bus Width: 352-bit
L2 Cache Size: 5767168 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 137 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 6: “GeForce RTX 2080 Ti”
CUDA Driver Version / Runtime Version 11.0 / 10.1
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 11019 MBytes (11554717696 bytes)
(68) Multiprocessors, ( 64) CUDA Cores/MP: 4352 CUDA Cores
GPU Max Clock rate: 1545 MHz (1.54 GHz)
Memory Clock rate: 7000 Mhz
Memory Bus Width: 352-bit
L2 Cache Size: 5767168 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 177 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 7: “GeForce RTX 2080 Ti”
CUDA Driver Version / Runtime Version 11.0 / 10.1
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 11019 MBytes (11554717696 bytes)
(68) Multiprocessors, ( 64) CUDA Cores/MP: 4352 CUDA Cores
GPU Max Clock rate: 1545 MHz (1.54 GHz)
Memory Clock rate: 7000 Mhz
Memory Bus Width: 352-bit
L2 Cache Size: 5767168 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 178 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Peer access from GeForce RTX 2080 Ti (GPU0) → GeForce RTX 2080 Ti (GPU1) : No
Peer access from GeForce RTX 2080 Ti (GPU0) → GeForce RTX 2080 Ti (GPU2) : No
Peer access from GeForce RTX 2080 Ti (GPU0) → GeForce RTX 2080 Ti (GPU3) : No
Peer access from GeForce RTX 2080 Ti (GPU0) → GeForce RTX 2080 Ti (GPU4) : No
Peer access from GeForce RTX 2080 Ti (GPU0) → GeForce RTX 2080 Ti (GPU5) : No
Peer access from GeForce RTX 2080 Ti (GPU0) → GeForce RTX 2080 Ti (GPU6) : No
Peer access from GeForce RTX 2080 Ti (GPU0) → GeForce RTX 2080 Ti (GPU7) : No
Peer access from GeForce RTX 2080 Ti (GPU1) → GeForce RTX 2080 Ti (GPU0) : No
Peer access from GeForce RTX 2080 Ti (GPU1) → GeForce RTX 2080 Ti (GPU2) : No
Peer access from GeForce RTX 2080 Ti (GPU1) → GeForce RTX 2080 Ti (GPU3) : No
Peer access from GeForce RTX 2080 Ti (GPU1) → GeForce RTX 2080 Ti (GPU4) : No
Peer access from GeForce RTX 2080 Ti (GPU1) → GeForce RTX 2080 Ti (GPU5) : No
Peer access from GeForce RTX 2080 Ti (GPU1) → GeForce RTX 2080 Ti (GPU6) : No
Peer access from GeForce RTX 2080 Ti (GPU1) → GeForce RTX 2080 Ti (GPU7) : No
Peer access from GeForce RTX 2080 Ti (GPU2) → GeForce RTX 2080 Ti (GPU0) : No
Peer access from GeForce RTX 2080 Ti (GPU2) → GeForce RTX 2080 Ti (GPU1) : No
Peer access from GeForce RTX 2080 Ti (GPU2) → GeForce RTX 2080 Ti (GPU3) : No
Peer access from GeForce RTX 2080 Ti (GPU2) → GeForce RTX 2080 Ti (GPU4) : No
Peer access from GeForce RTX 2080 Ti (GPU2) → GeForce RTX 2080 Ti (GPU5) : No
Peer access from GeForce RTX 2080 Ti (GPU2) → GeForce RTX 2080 Ti (GPU6) : No
Peer access from GeForce RTX 2080 Ti (GPU2) → GeForce RTX 2080 Ti (GPU7) : No
Peer access from GeForce RTX 2080 Ti (GPU3) → GeForce RTX 2080 Ti (GPU0) : No
Peer access from GeForce RTX 2080 Ti (GPU3) → GeForce RTX 2080 Ti (GPU1) : No
Peer access from GeForce RTX 2080 Ti (GPU3) → GeForce RTX 2080 Ti (GPU2) : No
Peer access from GeForce RTX 2080 Ti (GPU3) → GeForce RTX 2080 Ti (GPU4) : No
Peer access from GeForce RTX 2080 Ti (GPU3) → GeForce RTX 2080 Ti (GPU5) : No
Peer access from GeForce RTX 2080 Ti (GPU3) → GeForce RTX 2080 Ti (GPU6) : No
Peer access from GeForce RTX 2080 Ti (GPU3) → GeForce RTX 2080 Ti (GPU7) : No
Peer access from GeForce RTX 2080 Ti (GPU4) → GeForce RTX 2080 Ti (GPU0) : No
Peer access from GeForce RTX 2080 Ti (GPU4) → GeForce RTX 2080 Ti (GPU1) : No
Peer access from GeForce RTX 2080 Ti (GPU4) → GeForce RTX 2080 Ti (GPU2) : No
Peer access from GeForce RTX 2080 Ti (GPU4) → GeForce RTX 2080 Ti (GPU3) : No
Peer access from GeForce RTX 2080 Ti (GPU4) → GeForce RTX 2080 Ti (GPU5) : No
Peer access from GeForce RTX 2080 Ti (GPU4) → GeForce RTX 2080 Ti (GPU6) : No
Peer access from GeForce RTX 2080 Ti (GPU4) → GeForce RTX 2080 Ti (GPU7) : No
Peer access from GeForce RTX 2080 Ti (GPU5) → GeForce RTX 2080 Ti (GPU0) : No
Peer access from GeForce RTX 2080 Ti (GPU5) → GeForce RTX 2080 Ti (GPU1) : No
Peer access from GeForce RTX 2080 Ti (GPU5) → GeForce RTX 2080 Ti (GPU2) : No
Peer access from GeForce RTX 2080 Ti (GPU5) → GeForce RTX 2080 Ti (GPU3) : No
Peer access from GeForce RTX 2080 Ti (GPU5) → GeForce RTX 2080 Ti (GPU4) : No
Peer access from GeForce RTX 2080 Ti (GPU5) → GeForce RTX 2080 Ti (GPU6) : No
Peer access from GeForce RTX 2080 Ti (GPU5) → GeForce RTX 2080 Ti (GPU7) : No
Peer access from GeForce RTX 2080 Ti (GPU6) → GeForce RTX 2080 Ti (GPU0) : No
Peer access from GeForce RTX 2080 Ti (GPU6) → GeForce RTX 2080 Ti (GPU1) : No
Peer access from GeForce RTX 2080 Ti (GPU6) → GeForce RTX 2080 Ti (GPU2) : No
Peer access from GeForce RTX 2080 Ti (GPU6) → GeForce RTX 2080 Ti (GPU3) : No
Peer access from GeForce RTX 2080 Ti (GPU6) → GeForce RTX 2080 Ti (GPU4) : No
Peer access from GeForce RTX 2080 Ti (GPU6) → GeForce RTX 2080 Ti (GPU5) : No
Peer access from GeForce RTX 2080 Ti (GPU6) → GeForce RTX 2080 Ti (GPU7) : No
Peer access from GeForce RTX 2080 Ti (GPU7) → GeForce RTX 2080 Ti (GPU0) : No
Peer access from GeForce RTX 2080 Ti (GPU7) → GeForce RTX 2080 Ti (GPU1) : No
Peer access from GeForce RTX 2080 Ti (GPU7) → GeForce RTX 2080 Ti (GPU2) : No
Peer access from GeForce RTX 2080 Ti (GPU7) → GeForce RTX 2080 Ti (GPU3) : No
Peer access from GeForce RTX 2080 Ti (GPU7) → GeForce RTX 2080 Ti (GPU4) : No
Peer access from GeForce RTX 2080 Ti (GPU7) → GeForce RTX 2080 Ti (GPU5) : No
Peer access from GeForce RTX 2080 Ti (GPU7) → GeForce RTX 2080 Ti (GPU6) : No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.0, CUDA Runtime Version = 10.1, NumDevs = 8, Device0 = GeForce RTX 2080 Ti, Device1 = GeForce RTX 2080 Ti, Device2 = GeForce RTX 2080 Ti, Device3 = GeForce RTX 2080 Ti, Device4 = GeForce RTX 2080 Ti, Device5 = GeForce RTX 2080 Ti, Device6 = GeForce RTX 2080 Ti, Device7 = GeForce RTX 2080 Ti
Result = PASS

This rather looks like a torch problem then. Please check if dumping the torch config yields additional info

import torch.__config__
print(torch.__config__.show())

Piggybacking off this post, I had the same issue (pytorch results:)

PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

Do you have the appropriate cuda/cudnn versions installed?

Terminal ( dpkg -l | grep cudnn )

ii  libcudnn7                                  7.6.5.32-1+cuda10.2                                   amd64        cuDNN runtime libraries
ii  libcudnn7-dev                              7.6.5.32-1+cuda10.2                                   amd64        cuDNN development libraries and headers
ii  libcudnn7-doc                              7.6.5.32-1+cuda10.2                                   amd64        cuDNN documents and samples

nvcc --version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89

However, there is a conflict with nvidia-smi. I don’t think it matters, I remember reading that smi just shows as a suggestion.

My nvidia-smi, Driver Version: 450.51.05 , CUDA: 11.0

The cuda version nvidia-smi reports is just the maximum supported cuda version of the driver, just has to be equal or higher than the cuda rt used.
Only issue I can see that might be relevant is it’s using cuda 10.2, max sm 7.0 which means Volta gen. What kind of gpu are you using?

It’s GTX 2080

Since Turing gen gpus have sm 7.5, this might be the problem, don’t know for sure, though. Better check with the pytorch people or in the cuda forums:
https://forums.developer.nvidia.com/c/accelerated-computing/cuda/206

Alright, thank you.