Inaccurate results for int8 in cublasGemmEx

philzip1 · March 23, 2024, 1:15pm

When using cublasGemmEx with CUDA_R_8I for A and B, and CUDA_R_32I for C, I’m getting results that slightly differ from the actual results. This all sounds pretty exactly like the float bug that was supposedly fixed in 12.3.1: 1. CUDA 12.4 Release Notes — Release Notes 12.4 documentation

“cublasLtMatmul with A and B types equal CUDA_R_8I, scale and C type equal CUDA_R_32I and compute type equals CUBLAS_COMPUTE_32I{,_PEDANTIC} could erroneously perform casting of the result to float data type and then back to integer which result in accuracy loss (for output greater than 2^24). The issue exists since cuBLAS 11.0.”

However, also when updating to the latest CUDA Version 12.4, I still get the inaccurate results. Interestingly, the bug only seems to occur on Hopper (H100 in my test), neither on A100 nor RTX4090. Here is a little demo. Would be interesting to see if this bug is also reproducible for others or my CUDA version update went wrong.

#include <cublas_v2.h>
#include <cuda_runtime.h>
#include <iostream>
#include <vector>
#include <cstdlib>
#include <ctime>

// Basic CPU impl
void simpleCpuGemm(int m, int n, int k, const int8_t *A, const int8_t *B, int32_t *C)
{
    for (int row = 0; row < m; ++row)
    {
        for (int col = 0; col < n; ++col)
        {
            int32_t sum = 0;
            for (int i = 0; i < k; ++i)
            {
                sum += A[i + row * k] * B[i + col * k];
            }
            C[row + col * m] = sum;
        }
    }
}

int main()
{
    cublasHandle_t handle;
    cublasCreate(&handle);

    int m = 100, n = 32, k = 12800;
    std::srand(std::time(0));

    std::vector<int8_t> h_A(m * k);
    std::vector<int8_t> h_B(k * n);
    std::vector<int32_t> h_C(m * n);
    std::vector<int32_t> h_C_cpu(m * n);

    for (int i = 0; i < m * k; ++i)
    {
        h_A[i] = static_cast<int8_t>(std::rand() % 127);
    }
    for (int i = 0; i < k * n; ++i)
    {
        h_B[i] = static_cast<int8_t>(std::rand() % 127);
    }

    int8_t *A, *B;
    int32_t *C;
    cudaMalloc(&A, m * k * sizeof(int8_t));
    cudaMalloc(&B, k * n * sizeof(int8_t));
    cudaMalloc(&C, m * n * sizeof(int32_t));

    cudaMemcpy(A, h_A.data(), m * k * sizeof(int8_t), cudaMemcpyHostToDevice);
    cudaMemcpy(B, h_B.data(), k * n * sizeof(int8_t), cudaMemcpyHostToDevice);

    const int32_t alpha = 1;
    const int32_t beta = 0;

    // Matmul using cublasGemmEx
    cublasGemmEx(handle, CUBLAS_OP_T, CUBLAS_OP_N,
                 m, n, k,
                 &alpha,
                 A, CUDA_R_8I, k,
                 B, CUDA_R_8I, k,
                 &beta,
                 C, CUDA_R_32I, m,
                 CUBLAS_COMPUTE_32I_PEDANTIC, CUBLAS_GEMM_DEFAULT);

    cudaMemcpy(h_C.data(), C, m * n * sizeof(int32_t), cudaMemcpyDeviceToHost);

    // CPU sanity check
    simpleCpuGemm(m, n, k, h_A.data(), h_B.data(), h_C_cpu.data());

    // Compare the results
    int diffs = 0;
    for (int i = 0; i < m * n; ++i)
    {
        if (h_C_cpu[i] != h_C[i])
        {
            std::cout << h_C_cpu[i] << " " << h_C[i] << " " << i << "\n";
            diffs++;
        }
    }
    std::cout << "The results " << (diffs == 0 ? "MATCH" : "DO NOT MATCH") << ": " << diffs << " out of " << n * m << " values differ\n";

    cudaFree(A);
    cudaFree(B);
    cudaFree(C);
    cublasDestroy(handle);

    return 0;
}

kwierschem · April 3, 2024, 5:37pm

Hi @philzip1 , thanks for the report! I was able to reproduce the issue with 12.4, so it sounds like the release notes were incorrect. The issue is actually fixed in an upcoming release. In the meantime, the release notes have been corrected.

philzip1 · April 5, 2024, 5:58pm

thanks for the reply @kwierschem. But it was not included in the recently published minor release?

kwierschem · April 5, 2024, 11:08pm

Yes, that’s correct. The fix is not included in either 12.4 or 12.4 Update 1, but should be part of a later release.

system · April 19, 2024, 11:09pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
cublasGemmEx doesn't work with INT8 utilizing __dp4a instruction on NVIDIA 1080TI CUDA Programming and Performance	12	3639	September 25, 2017
How does cublasGemmEx() call work with CUDA_R_16F inputs and CUDA_R_32F computeType CUDA Programming and Performance	3	1842	December 10, 2017
cublasSgemm produces non-trivially different results in CUDA 9.1 vs CUDA 8.0 GPU-Accelerated Libraries	9	1127	February 19, 2019
cublasGemmEx failed: 15, ieee_inexact in Fortran and C++ GPU-Accelerated Libraries cuda	4	38	February 28, 2025
cublasGemmEx cant use CUDA_R_8I compute type on GTX1080 GPU-Accelerated Libraries	4	1366	February 12, 2018
cublasGemmEX() INT-8 runtime error GPU-Accelerated Libraries cuda	7	1981	October 12, 2021
cufftExecR2C and cufftExecC2R API calls generates different results in different CUDA tool kit versions GPU-Accelerated Libraries cufft	1	1547	August 9, 2021
Incorrect inversion for matrix with CUBLAS CUDA Programming and Performance	4	1293	December 22, 2016
cublasZgemm() gives false result for large data and potential bug GPU-Accelerated Libraries	6	1149	October 12, 2021
cublasStrsm_v2 returns unchanged input matrix GPU-Accelerated Libraries	2	1006	August 1, 2013

Inaccurate results for int8 in cublasGemmEx

Related topics