CUBLAS_STATUS_INTERNAL_ERROR when running on a specific GPU

nvidia-bug-report.log.gz (3.0 MB)

Hi, I have a server with 8 RTX3090 GPUs, and I’m encountering a CUDA error exclusively when my code is executed on a particular GPU. Specifically, the issue arises only when I set CUDA_VISIBLE_DEVICES=0 . There are no such problems when I use CUDA_VISIBLE_DEVICES=1 , CUDA_VISIBLE_DEVICES=2 , etc.

Based on this, I suspect there might be a hardware issue with my first GPU (GPU 0). However, I’m finding it challenging to confirm this suspicion. Could you suggest any methods or steps to determine whether this is indeed a hardware-related issue?

Below is the error log:

RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUD A_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_O P)

The affected gpu is throwing an Xid 31, this might point to defective video memory. Please check using cudagpumemtest and have it replaced if faulty.