Why does cublasSgemm uses `f16` for `float`?

grynet · March 8, 2023, 2:38pm

I am using cublasSgemm with tensor core on A100, see my snippet is below. Even though I have a float type here, I am seeing that it calls f16 type.

float *d_A = 0, *d_B = 0, *d_C = 0, alpha = 1.0f;
cudaMalloc(reinterpret_cast<void **>(&d_A), n2 * sizeof(d_A[0]))
cudaMalloc(reinterpret_cast<void **>(&d_B), n2 * sizeof(d_B[0]))
cudaMalloc(reinterpret_cast<void **>(&d_C), n2 * sizeof(d_C[0]))
--
cublasSetVector(n2, sizeof(float), h_A, 1, d_A, 1);
cublasSetVector(n2, sizeof(float), h_B, 1, d_B, 1);
cublasSetVector(n2, sizeof(h_C[0]), h_C, 1, d_C, 1);
--
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N, N, N, &alpha, d_A, N, d_B, N, &beta, d_C, N);

Profiler result from nsys. See the name has f16, I am wondering why is the reason. As far as I can see from the cuttlass, this is cutlass::half_t so it’s 16bit. I would expect cublas to use f32 or tf32 here.

Am I doing something wrong? If not, how can cublas use f16 for float?

 Start (ns)  Duration (ns)  CorrId  GrdX  GrdY  GrdZ  BlkX  BlkY  BlkZ  Reg/Trd  StcSMem (MB)  DymSMem (MB)  Bytes (MB)  Throughput (MBps)  SrcMemKd  DstMemKd           Device            Ctx  Strm                                            Name                                           
NVIDIA A100-SXM4-40GB (0)    1     7  void cutlass::Kernel<cutlass_80_tensorop_s1688f16gemm_256x128_32x3_nn_align4>(T1::Params)

Robert_Crovella · March 8, 2023, 6:10pm

I don’t see it. Here is my test case, CUDA 12.0:

$ cat t1.cu
#include <cublas_v2.h>

int main(){

  cublasHandle_t handle;
  cublasCreate(&handle);

  float *d_A = 0, *d_B = 0, *d_C = 0, alpha = 1.0f;
  const int N = 1024;
  const int n2 = N*N;
  cudaMalloc(reinterpret_cast<void **>(&d_A), n2 * sizeof(d_A[0]));
  cudaMalloc(reinterpret_cast<void **>(&d_B), n2 * sizeof(d_B[0]));
  cudaMalloc(reinterpret_cast<void **>(&d_C), n2 * sizeof(d_C[0]));
  float beta = 0.0f;
  cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N, N, N, &alpha, d_A, N, d_B, N, &beta, d_C, N);


  cudaDeviceSynchronize();
}
$ nvcc -o t1 t1.cu -lcublas
$ nsys profile --stats=true ./t1
<snip>

[6/8] Executing 'gpukernsum' stats report

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)     GridXYZ         BlockXYZ              Name
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  --------------  --------------  ----------------------
    100.0          149,152          1  149,152.0  149,152.0   149,152   149,152          0.0     8   16    5   128    1    1  ampere_sgemm_128x64_nn
<snip>
$ sudo /usr/local/cuda/bin/ncu ./t1
==PROF== Connected to process 43469 (/home/.../t1)
==PROF== Profiling "ampere_sgemm_128x64_nn" - 0: 0%....50%....100% - 10 passes
==PROF== Disconnected from process 43469
[43469] t1@127.0.0.1
  ampere_sgemm_128x64_nn (8, 16, 5)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 8.0
<snip>
$ nvidia-smi
Wed Mar  8 17:55:14 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:01:00.0 Off |                    0 |
| N/A   30C    P0    54W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:41:00.0 Off |                    0 |
| N/A   28C    P0    50W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:81:00.0 Off |                    0 |
| N/A   28C    P0    49W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:C1:00.0 Off |                    0 |
| N/A   27C    P0    49W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Perhaps your nsys profiler output is not actually corresponding to that particular call, but something else in your code?

Otherwise please provide a complete test case, just as I have done, including CUDA version.

grynet · March 8, 2023, 9:18pm

Your need to enable tensor core for cublas, otherwise it uses simt version.

You can set it with this line below
cublasSetMathMode( handle, CUBLAS_TENSOR_OP_MATH );

grynet · March 8, 2023, 9:20pm

I used this example. But the code is same that comes with cuda examples.

Robert_Crovella · March 8, 2023, 9:31pm

The reason for that is documented:

This mode is deprecated and will be removed in a future release. Allows the library to use Tensor Core operations whenever possible. For single precision GEMM routines cuBLAS will use the CUBLAS_COMPUTE_32F_FAST_16F compute type.

(emphasis added)

Usage of that is deprecated, so for future compatibility you should remove it from your code.

And if you find that behavior objectionable, then don’t use that switch.

grynet · March 8, 2023, 9:50pm

Thanks. I was using the cublas from 11.5 and I didn’t notice this flag is deprecated.

Going back to my original question, how can cublas use 16F for float type?

Robert_Crovella · March 8, 2023, 9:59pm

It uses something called “automatic down-conversion”.

I don’t have a detailed description, but I believe the expectation is that your input data must fit within FP16 type in order for the calculation to produce expected results.

Example:

$ cat t2208.cu
#include <cublas_v2.h>
#include <limits>
#include <iostream>

int main(){

  cublasHandle_t handle;
  cublasCreate(&handle);
#ifdef USE_TC
  cublasSetMathMode( handle, CUBLAS_TENSOR_OP_MATH );
#endif
  float *h_X, *d_A = 0, *d_B = 0, *d_C = 0, alpha = 1.0f;
  const int N = 1024;
  const int n2 = N*N;
  h_X = new float[n2]();
  cudaMalloc(reinterpret_cast<void **>(&d_A), n2 * sizeof(d_A[0]));
  cudaMalloc(reinterpret_cast<void **>(&d_B), n2 * sizeof(d_B[0]));
  cudaMalloc(reinterpret_cast<void **>(&d_C), n2 * sizeof(d_C[0]));
  for (int i = 0; i < n2; i+=N+1)
    h_X[i] = 0.1f;
  cudaMemcpy(d_A, h_X, n2*sizeof(h_X[0]), cudaMemcpyHostToDevice);
  for (int i = 0; i < n2; i++) h_X[i] = std::numeric_limits<float>::max();
  cudaMemcpy(d_B, h_X, n2*sizeof(h_X[0]), cudaMemcpyHostToDevice);
  float beta = 0.0f;
  cublasStatus_t s = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N, N, N, &alpha, d_A, N, d_B, N, &beta, d_C, N);
  cudaMemcpy(h_X, d_C, n2*sizeof(h_X[0]), cudaMemcpyDeviceToHost);
  std::cout << std::numeric_limits<float>::max() << std::endl;
  std::cout << "status: " << (int) s << std::endl;
  std::cout << "result[0]: " << h_X[0] << std::endl;
}
$ nvcc -o t2208 t2208.cu -lcublas
$ ./t2208
3.40282e+38
status: 0
result[0]: 3.40282e+37
$ nvcc -o t2208 t2208.cu -lcublas -DUSE_TC
$ ./t2208
3.40282e+38
status: 0
result[0]: nan
$

If you’d like to see an improvement in the documentation, please file a bug.

system · March 22, 2023, 10:00pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
cuBLAS call from kernel in CUDA 10.0 GPU-Accelerated Libraries	9	4847	April 7, 2021
why cublasHgemm is slower more than cublasSgemm when I use? GPU-Accelerated Libraries	6	4315	January 22, 2019
Where can I find working examples for the new cuBLASLt library? GPU-Accelerated Libraries	35	5930	March 16, 2020
The best input layout settings in CuBlas GPU-Accelerated Libraries cublas	4	209	August 27, 2024
cublasHgemm is slower than cublasSgemm in CUDA 11.1 when I use? GPU-Accelerated Libraries	2	503	December 1, 2020
cublasGemmEx failed: 15, ieee_inexact in Fortran and C++ GPU-Accelerated Libraries cuda	4	41	February 28, 2025
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	942	August 23, 2018
cuBLAS GEMM INT8 is much slower than FP16 in T4 GPU-Accelerated Libraries cublas	11	4302	November 2, 2023
How does cublasGemmEx() call work with CUDA_R_16F inputs and CUDA_R_32F computeType CUDA Programming and Performance	3	1849	December 10, 2017
Program hit cudaErrorInvalidValue (error 1) due to "invalid argument" on CUDA API call to cudaMemsetAsync CUDA Programming and Performance	7	7673	January 11, 2020

Why does cublasSgemm uses `f16` for `float`?

Related topics