Problems with 4090, CUDA (samples), cuDNN (sample). Are these expected?

Greetings,
I am trying to use TensorRT in our Linux 22.04.1 device with an RTX 4090, however, running into several problems. Is there something I am doing wrong here in terms of compatibility? Can we do anything other than waiting for an update?
(1) When I tried to install CUDA 11.8 (cuda_11.8.0_520.61.05_linux.run) with run file and use the bundled driver 520.61.05, my screen goes blank.
(2) So I purged everything and installed 520.56.06 (NVIDIA-Linux-x86_64-520.56.06.run), then CUDA 11.8 (cuda_11.8.0_520.61.05_linux.run, driver is deselected). Performed post installation steps (adding to path and exporting the lib)
(3) To start with TensorRT, I have installed PyCuda and moved on to cuDNN.
(4) Zlib was already there used this deb: cudnn-local-repo-ubuntu2204-8.6.0.163_1.0-1_amd64.deb
(5) Tried to run MNIST sample but it sometimes gives this:
Executing: mnistCUDNN
cudnnGetVersion() : 8600 , CUDNN_VERSION from cudnn.h : 8600 (8.6.0)
Host compiler version : GCC 11.3.0

There are 1 CUDA capable devices on your machine :
device 0 : sms 128 Capabilities 8.9, SmClock 2580.0 Mhz, MemSize (Mb) 24252, MemClock 10501.0 Mhz, Ecc=0, boardGroupID=0
Using device 0

Testing single precision
Loading binary file data/conv1.bin
Loading binary file data/conv1.bias.bin
Loading binary file data/conv2.bin
Loading binary file data/conv2.bias.bin
Loading binary file data/ip1.bin
Loading binary file data/ip1.bias.bin
Loading binary file data/ip2.bin
Loading binary file data/ip2.bias.bin
Loading image data/one_28x28.pgm
Performing forward propagation …
Testing cudnnGetConvolutionForwardAlgorithm_v7 …
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 184784 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 2057744 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm …
ERROR: cudnn failure (CUDNN_STATUS_ALLOC_FAILED) in mnistCUDNN.cpp:589
Aborting…

(6) Sometimes this:
Executing: mnistCUDNN
cudnnGetVersion() : 8600 , CUDNN_VERSION from cudnn.h : 8600 (8.6.0)
Host compiler version : GCC 11.3.0

There are 1 CUDA capable devices on your machine :
device 0 : sms 128 Capabilities 8.9, SmClock 2580.0 Mhz, MemSize (Mb) 24252, MemClock 10501.0 Mhz, Ecc=0, boardGroupID=0
Using device 0

Testing single precision
ERROR: cudnn failure (CUDNN_STATUS_INTERNAL_ERROR) in mnistCUDNN.cpp:414
Aborting…

(7) Both are run with sudo

(8)
±----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06 Driver Version: 520.56.06 CUDA Version: 11.8 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … Off | 00000000:01:00.0 On | Off |
| 0% 44C P2 114W / 450W | 503MiB / 24564MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2025 G /usr/lib/xorg/Xorg 244MiB |
| 0 N/A N/A 2282 G /usr/bin/gnome-shell 81MiB |
| 0 N/A N/A 3519 G …8/usr/lib/firefox/firefox 154MiB |
| 0 N/A N/A 34621 G …mviewer/tv_bin/TeamViewer 23MiB |
±----------------------------------------------------------------------------+

(9)
./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “NVIDIA GeForce RTX 4090”
CUDA Driver Version / Runtime Version 11.8 / 11.8
CUDA Capability Major/Minor version number: 8.9
Total amount of global memory: 24252 MBytes (25430589440 bytes)
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
(128) Multiprocessors, (128) CUDA Cores/MP: 16384 CUDA Cores
GPU Max Clock rate: 2580 MHz (2.58 GHz)
Memory Clock rate: 10501 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 75497472 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.8, CUDA Runtime Version = 11.8, NumDevs = 1
Result = PASS

(10) Should I be scared of this?
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM

(11)
./deviceQueryDrv Starting…

CUDA Device Query (Driver API) statically linked version
Detected 1 CUDA Capable device(s)

Device 0: “NVIDIA GeForce RTX 4090”
CUDA Driver Version: 11.8
CUDA Capability Major/Minor version number: 8.9
Total amount of global memory: 24252 MBytes (25430589440 bytes)
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
(128) Multiprocessors, (128) CUDA Cores/MP: 16384 CUDA Cores
GPU Max Clock rate: 2580 MHz (2.58 GHz)
Memory Clock rate: 10501 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 75497472 bytes
Max Texture Dimension Sizes 1D=(131072) 2D=(131072, 65536) 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Texture alignment: 512 bytes
Maximum memory pitch: 2147483647 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Result = PASS

(12)
./bandwidthTest
[CUDA Bandwidth Test] - Starting…
Running on…

Device 0: NVIDIA GeForce RTX 4090
Quick Mode

CUDA error at bandwidthTest.cu:686 code=46(cudaErrorDevicesUnavailable) “cudaEventCreate(&start)”

Hi @volkan.dinc1 ,
Apologies for delayed response, we are checking on this and will get back to you.
Thank you for your patience.

Hi @volkan.dinc1 ,
Can you please turn on api logging and attach a log . This will help to exactly see where in cuDNN the failures are happening.

Thanks

I don’t have problems with 11.8 and cudnn v8.7.0 in pytorch and tensorflow, but when cudnn release for cuda 12.0?

I have the same issue with ubuntu22.04, cuda12.0 and cudnn8.8.

First time I run the ./mnustCUDNN:

... (other printouts)
CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory 
...

second time:

cudnnGetVersion() : 8800 , CUDNN_VERSION from cudnn.h : 8800 (8.8.0)
Host compiler version : GCC 11.3.0

There are 1 CUDA capable devices on your machine :
device 0 : sms 32  Capabilities 8.6, SmClock 1725.0 Mhz, MemSize (Mb) 12045, MemClock 7001.0 Mhz, Ecc=0, boardGroupID=0
Using device 0

Testing single precision
ERROR: cudnn failure (CUDNN_STATUS_INTERNAL_ERROR) in mnistCUDNN.cpp:414
Aborting...

When trying to call jax.random.RNGKey(0) in python I get:

jaxlib.xla_extension.XlaRuntimeError: INTERNAL: RET_CHECK failure (external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc:627) dnn != nullptr

I am also running into this problem ubuntu22.04: driver525.85.05, cuda 12.0, cudnn 8.8 but I am running on an RTX 3060 XC from EVGA with the 12GB GDDR6 along with a pair of A2000’s with 12GB as well.