CUBLAS_STATUS_NOT_SUPPORTED for BF16 (Cuda11.6, Pytorch)

foreveronehundred · January 4, 2023, 5:20am

Hi,
I just got the following error when I trained my Pytorch model with bfloat16 parameters

File “/opt/conda/envs/XXX/lib/python3.8/site-packages/torch/nn/modules/linear.py”, line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 768 n 2304 k 768 mat1_ld 768 mat2_ld 768 result_ld 768 abcType 14 computeType 68 scaleType 0

The types of input, self.weight, self.bias are all bfloat16 and the shapes are (9, 256, 768), (768, 768), (768, ), respectively.

My Pytorch version is 1.14.0.dev20221213+cu116, and my python version is 3.8.15.
Besides, I used 8 A100 GPUs (80 GB).

I ran “torch.cuda.is_bf16_supported()”, and got “True”.

Actually, I have tried different models with BF16 parameters but did not got the same error. However, for some reason, I could not share my model to you. Please tell me if you have any idea about the error. Thanks.

foreveronehundred · January 4, 2023, 5:50am

Also, I tried different Pytorch and CUDA version.

Pytorch1.12.1, CUDA11.6

conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())

Pytorch1.12.1, CUDA11.3

conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch

No error messages

Robert_Crovella · January 4, 2023, 4:39pm

You may get better help on the pytorch forum. NVIDIA experts such as ptrblck regularly help people there.

Topic		Replies	Views
UBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmStridedBatchedEx" During Grounding DINO Training Container: CUDA tao	0	142	January 14, 2025
cublasGemmEX() INT-8 runtime error GPU-Accelerated Libraries cuda	7	1986	October 12, 2021
Issue with cuBLAS Batched Matrix Multiplication Result GPU-Accelerated Libraries cuda , pytorch	1	909	March 22, 2021
[E] 2: [ltWrapper.cpp::setupHeuristic::349] Error Code 2: Internal Error (Assertion cublasStatus == CUBLAS_STATUS_SUCCESS failed. ) TensorRT tensorrt , cuda , ubuntu , python	1	1112	July 8, 2022
CUDA Not Enabled Error with GatorTron Model on PyTorch CUDA Setup and Installation cuda , pytorch , extract-transform-load-etl , gpu	0	29	August 19, 2024
Unable to add NVIDIA CUDA repository on Ubuntu 22.04 due to missing key/(7fa2af80.pub) CUDA Setup and Installation	1	2116	May 3, 2023
Bfloat16 has worse performance than float16 for conv2d in Pytorch CUDA Programming and Performance cuda , kernel , pytorch , python	4	2902	July 6, 2022
Pytorch1.0, Cuda9.0, cudnn7.4, failed with 'cublas runtime error' CUDA Programming and Performance cuda , ubuntu , pytorch , python	1	1173	October 8, 2022
cublasGemmEx doesn't work with INT8 utilizing __dp4a instruction on NVIDIA 1080TI CUDA Programming and Performance	12	3642	September 25, 2017
Cublas Bug GPU-Accelerated Libraries cublas	8	2035	June 21, 2022

CUBLAS_STATUS_NOT_SUPPORTED for BF16 (Cuda11.6, Pytorch)

Related topics