Why does cublasSgemm uses `f16` for `float`?

I am using cublasSgemm with tensor core on A100, see my snippet is below. Even though I have a float type here, I am seeing that it calls f16 type.

float *d_A = 0, *d_B = 0, *d_C = 0, alpha = 1.0f;
cudaMalloc(reinterpret_cast<void **>(&d_A), n2 * sizeof(d_A[0]))
cudaMalloc(reinterpret_cast<void **>(&d_B), n2 * sizeof(d_B[0]))
cudaMalloc(reinterpret_cast<void **>(&d_C), n2 * sizeof(d_C[0]))
--
cublasSetVector(n2, sizeof(float), h_A, 1, d_A, 1);
cublasSetVector(n2, sizeof(float), h_B, 1, d_B, 1);
cublasSetVector(n2, sizeof(h_C[0]), h_C, 1, d_C, 1);
--
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N, N, N, &alpha, d_A, N, d_B, N, &beta, d_C, N);

Profiler result from nsys. See the name has f16, I am wondering why is the reason. As far as I can see from the cuttlass, this is cutlass::half_t so it’s 16bit. I would expect cublas to use f32 or tf32 here.

Am I doing something wrong? If not, how can cublas use f16 for float?

 Start (ns)  Duration (ns)  CorrId  GrdX  GrdY  GrdZ  BlkX  BlkY  BlkZ  Reg/Trd  StcSMem (MB)  DymSMem (MB)  Bytes (MB)  Throughput (MBps)  SrcMemKd  DstMemKd           Device            Ctx  Strm                                            Name                                           
NVIDIA A100-SXM4-40GB (0)    1     7  void cutlass::Kernel<cutlass_80_tensorop_s1688f16gemm_256x128_32x3_nn_align4>(T1::Params)

I don’t see it. Here is my test case, CUDA 12.0:

$ cat t1.cu
#include <cublas_v2.h>

int main(){

  cublasHandle_t handle;
  cublasCreate(&handle);

  float *d_A = 0, *d_B = 0, *d_C = 0, alpha = 1.0f;
  const int N = 1024;
  const int n2 = N*N;
  cudaMalloc(reinterpret_cast<void **>(&d_A), n2 * sizeof(d_A[0]));
  cudaMalloc(reinterpret_cast<void **>(&d_B), n2 * sizeof(d_B[0]));
  cudaMalloc(reinterpret_cast<void **>(&d_C), n2 * sizeof(d_C[0]));
  float beta = 0.0f;
  cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N, N, N, &alpha, d_A, N, d_B, N, &beta, d_C, N);


  cudaDeviceSynchronize();
}
$ nvcc -o t1 t1.cu -lcublas
$ nsys profile --stats=true ./t1
<snip>

[6/8] Executing 'gpukernsum' stats report

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)     GridXYZ         BlockXYZ              Name
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  --------------  --------------  ----------------------
    100.0          149,152          1  149,152.0  149,152.0   149,152   149,152          0.0     8   16    5   128    1    1  ampere_sgemm_128x64_nn
<snip>
$ sudo /usr/local/cuda/bin/ncu ./t1
==PROF== Connected to process 43469 (/home/.../t1)
==PROF== Profiling "ampere_sgemm_128x64_nn" - 0: 0%....50%....100% - 10 passes
==PROF== Disconnected from process 43469
[43469] t1@127.0.0.1
  ampere_sgemm_128x64_nn (8, 16, 5)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 8.0
<snip>
$ nvidia-smi
Wed Mar  8 17:55:14 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:01:00.0 Off |                    0 |
| N/A   30C    P0    54W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:41:00.0 Off |                    0 |
| N/A   28C    P0    50W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:81:00.0 Off |                    0 |
| N/A   28C    P0    49W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:C1:00.0 Off |                    0 |
| N/A   27C    P0    49W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Perhaps your nsys profiler output is not actually corresponding to that particular call, but something else in your code?

Otherwise please provide a complete test case, just as I have done, including CUDA version.

Your need to enable tensor core for cublas, otherwise it uses simt version.

You can set it with this line below
cublasSetMathMode( handle, CUBLAS_TENSOR_OP_MATH );

I used this example. But the code is same that comes with cuda examples.

The reason for that is documented:

This mode is deprecated and will be removed in a future release. Allows the library to use Tensor Core operations whenever possible. For single precision GEMM routines cuBLAS will use the CUBLAS_COMPUTE_32F_FAST_16F compute type.

(emphasis added)

Usage of that is deprecated, so for future compatibility you should remove it from your code.

And if you find that behavior objectionable, then don’t use that switch.

Thanks. I was using the cublas from 11.5 and I didn’t notice this flag is deprecated.

Going back to my original question, how can cublas use 16F for float type?

It uses something called “automatic down-conversion”.

I don’t have a detailed description, but I believe the expectation is that your input data must fit within FP16 type in order for the calculation to produce expected results.

Example:

$ cat t2208.cu
#include <cublas_v2.h>
#include <limits>
#include <iostream>

int main(){

  cublasHandle_t handle;
  cublasCreate(&handle);
#ifdef USE_TC
  cublasSetMathMode( handle, CUBLAS_TENSOR_OP_MATH );
#endif
  float *h_X, *d_A = 0, *d_B = 0, *d_C = 0, alpha = 1.0f;
  const int N = 1024;
  const int n2 = N*N;
  h_X = new float[n2]();
  cudaMalloc(reinterpret_cast<void **>(&d_A), n2 * sizeof(d_A[0]));
  cudaMalloc(reinterpret_cast<void **>(&d_B), n2 * sizeof(d_B[0]));
  cudaMalloc(reinterpret_cast<void **>(&d_C), n2 * sizeof(d_C[0]));
  for (int i = 0; i < n2; i+=N+1)
    h_X[i] = 0.1f;
  cudaMemcpy(d_A, h_X, n2*sizeof(h_X[0]), cudaMemcpyHostToDevice);
  for (int i = 0; i < n2; i++) h_X[i] = std::numeric_limits<float>::max();
  cudaMemcpy(d_B, h_X, n2*sizeof(h_X[0]), cudaMemcpyHostToDevice);
  float beta = 0.0f;
  cublasStatus_t s = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N, N, N, &alpha, d_A, N, d_B, N, &beta, d_C, N);
  cudaMemcpy(h_X, d_C, n2*sizeof(h_X[0]), cudaMemcpyDeviceToHost);
  std::cout << std::numeric_limits<float>::max() << std::endl;
  std::cout << "status: " << (int) s << std::endl;
  std::cout << "result[0]: " << h_X[0] << std::endl;
}
$ nvcc -o t2208 t2208.cu -lcublas
$ ./t2208
3.40282e+38
status: 0
result[0]: 3.40282e+37
$ nvcc -o t2208 t2208.cu -lcublas -DUSE_TC
$ ./t2208
3.40282e+38
status: 0
result[0]: nan
$

If you’d like to see an improvement in the documentation, please file a bug.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.