cublasGemmEx() should not return success when the scaler type is not correct

I’m testing the cublas GEMM speed on a 4090 GPU with cuda 12.1(cublas Here is the testing code.

#include <cublas_v2.h>
#include <cuda_runtime.h>
#include <cuda_fp16.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#define m 4096
#define n 4096
#define k 4096
int main()
    void *bufa, *bufb, *bufc;
    cudaMalloc(&bufa, m * k * sizeof(half));
    cudaMalloc(&bufb, k * n * sizeof(half));
    cudaMalloc(&bufc, m * n * sizeof(half));
    cublasHandle_t handle;
    cudaEvent_t start, stop;
    float alpha = 1.0f;
    float beta = 0.0f;
    for (int i = 0; i < 10000; i++)
        cublasStatus_t ret = cublasGemmEx(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, &alpha, bufa, CUDA_R_16F, m, bufb, CUDA_R_16F, k, &beta, bufc, CUDA_R_16F, m, CUBLAS_COMPUTE_16F, CUBLAS_GEMM_DEFAULT);
        assert(ret == CUBLAS_STATUS_SUCCESS);
    float time;
    cudaEventElapsedTime(&time, start, stop);
    cudaError_t error = cudaGetLastError();
    if (error != cudaSuccess)
        printf("error:%s\n", cudaGetErrorString(error));
    printf("time:%f\n", time / 10000);
    return 0;

The result of this code is that the program successfully finished with no errors reported. However, the execution time of the cudaEvent part is very small, and the flops are larger than the hardware limit. It looks like the kernel is not executed. The bug is that I use CUBLAS_COMPUTE_16F as the compute mode, and the datatype of alpha and beta is float. As the document says, it is not supported.

But I think the cublasGemmEx API should not return CUBLAS_STATUS_SUCCESS when the scaler type is incorrect. This behavior was very confusing for me until I found a similar experience: c++ - cublasGemmEx result is always zero - Stack Overflow .

My suggestion is that cublasGemmEx should reply CUBLAS_STATUS_INVALID_VALUE in this case.