After hours of debugging I’m kind of lost, I have the cuda sample for fp8 running, it works.
Something causes the heuristic routine to respond with 0 on the same input in my real code as compared to the sample code
I tried passing it demo data that is exactly as the sample (see below) and I still get that error as if I was on an old card.
I have latest nvidia drivers, a 4090 and 12.2 cublas/cuda installed.
cublasLtHandle_t ltHandle;
cublasLtMatmulDesc_t operationDesc;
cublasLtMatrixLayout_t Adesc = NULL, Bdesc = NULL, Cdesc = NULL, Ddesc = NULL;
m=n=k=ldc=ldb=lda=64;
transa = CUBLAS_OP_T; transb = CUBLAS_OP_N;
CUBLAS_CHECK(cublasLtMatmulDescCreate(&operationDesc, CUBLAS_COMPUTE_32F, CUDA_R_32F));
CUBLAS_CHECK(cublasLtMatmulDescSetAttribute(operationDesc, CUBLASLT_MATMUL_DESC_TRANSA, &transa, sizeof(transa)));
CUBLAS_CHECK(cublasLtMatmulDescSetAttribute(operationDesc, CUBLASLT_MATMUL_DESC_TRANSB, &transb, sizeof(transa)));
CUBLAS_CHECK(cublasLtMatrixLayoutCreate(&Adesc, CUDA_R_8F_E4M3, transa == CUBLAS_OP_N ? m : k, transa == CUBLAS_OP_N ? k : m, lda));
CUBLAS_CHECK(cublasLtMatrixLayoutCreate(&Bdesc, CUDA_R_8F_E4M3, transb == CUBLAS_OP_N ? k : n, transb == CUBLAS_OP_N ? n : k, ldb));
CUBLAS_CHECK(cublasLtMatrixLayoutCreate(&Cdesc, CUDA_R_16BF, m, n, ldc));
CUBLAS_CHECK(cublasLtMatrixLayoutCreate(&Ddesc, CUDA_R_8F_E4M3, m, n, ldc));
cublasLtMatmulHeuristicResult_t heuristicResult = {};
int returnedResults = 0;
cublasLtMatmulPreference_t preference = NULL;
CUBLAS_CHECK(cublasLtMatmulPreferenceCreate(&preference));
CUBLAS_CHECK(cublasLtMatmulAlgoGetHeuristic((cublasLtHandle_t)ltHandle, operationDesc, Adesc, Bdesc, Cdesc, Ddesc, preference, 1, &heuristicResult, &returnedResults));
**>>> returnedResults == 0 <<<**
cublasGemmEx works fine, in case that matters.
I have the same code in the sample_cublasLt_LtFp8Matmul.cu sample
I removed the scales from the sample code, removed the workspace allocation. It’s identical.
I looked at how nvcc compiles the two projects, it’s similar. (no arch specifications etc)
Compilation of sample:
"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\bin\nvcc.exe" --use-local-env -ccbin "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\
bin\HostX64\x64" -x cu -I"C:\program files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\include" -I"C:\program files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\lib\x64" -IC:\temp\examples\CUDALibrarySamples\cuBLASLt\LtFp8Matmul\..\Common -I"C
:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\include" --keep-dir x64\Debug -maxrregcount=0 --machine 64 --compile -cudart static -Xcompiler="/EHsc -Z
i -Ob0" -g -D_WINDOWS -D"CMAKE_INTDIR=\"Debug\"" -D_MBCS -DWIN32 -D_WINDOWS -D"CMAKE_INTDIR=\"Debug\"" -Xcompiler "/EHsc /W3 /nologo /Od /FS /Zi /RTC1 /MDd /GR"
Compilation of my cuda.cu:
"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\bin\nvcc.exe" --use-local-env -ccbin "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\HostX64\x64" -x cu -I"C:\program files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\include" -I"C:\program files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\lib\x64" -IC:\temp\. -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\include" --keep-dir x64\Debug -maxrregcount=0 --machine 64 --compile -cudart static -Xcompiler="/EHsc -Zi -Ob0" -g