Is there any tool to check hardware problems which lead to cuda/cudnn error?

Hi every one, recently I’ve been nagged by incessant cuda errors occuring to my RTX 4090 when I am running my pytorch code:

I’ve running my code stably for months before the cuda error sudden popped up. The first time I met those errors, I fixed by reinstalling the driver. But reinstalling does not work out for this time. I’ve also tried reinstalling my system, using docker images, strictly inspect my software dependencies, but the errors keep showing up after hundreds or thousands of iterations. I checked the validity of my code by running on another 4090. So basically, software-related issues have been ruled out.

My quesetion is: if, by any chance, there is hardware problems in my RTX4090 leading to cuda error, how can I determine it? I have tried Furmark and 3Dmark, but they didn’t show any sign of hardware problem.

I made a summary of errors occuring to me:

  1. RuntimeError: CUDA error: an illegal memory access was encountered

Adding CUDA_LAUNCH_BLOCKING=1 gives another errors:

  1. CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 256 n 16794 k 256 mat1_ld 256 mat2_ld 256 result_ld 256 abcType 0 computeType 68 scaleType 0
1 Like

Hi @Inkj , May i know how you are installing drivers.
The recommended way is -

I’ve encountered similar issues, with two GPUs at my disposal, one being an Asus ROG Strix RTX 4090 that operates flawlessly, and the other, a Gigabyte RTX 4090, displaying the same errors as yours. I’m unable to train models on PyTorch, despite having reinstalled various versions of the CUDA libraries thousands of times. Benchmarks seem to run fine, and I’ve reached out to the seller and even sent it in for repairs, but the problem persists unresolved. I’ve been grappling with this for half a year now and still haven’t found a solution.