I am using cublasSgemm with tensor core on A100, see my snippet is below. Even though I have a float
type here, I am seeing that it calls f16
type.
float *d_A = 0, *d_B = 0, *d_C = 0, alpha = 1.0f;
cudaMalloc(reinterpret_cast<void **>(&d_A), n2 * sizeof(d_A[0]))
cudaMalloc(reinterpret_cast<void **>(&d_B), n2 * sizeof(d_B[0]))
cudaMalloc(reinterpret_cast<void **>(&d_C), n2 * sizeof(d_C[0]))
--
cublasSetVector(n2, sizeof(float), h_A, 1, d_A, 1);
cublasSetVector(n2, sizeof(float), h_B, 1, d_B, 1);
cublasSetVector(n2, sizeof(h_C[0]), h_C, 1, d_C, 1);
--
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N, N, N, &alpha, d_A, N, d_B, N, &beta, d_C, N);
Profiler result from nsys. See the name has f16
, I am wondering why is the reason. As far as I can see from the cuttlass, this is cutlass::half_t
so it’s 16bit. I would expect cublas to use f32
or tf32
here.
Am I doing something wrong? If not, how can cublas use f16
for float
?
Start (ns) Duration (ns) CorrId GrdX GrdY GrdZ BlkX BlkY BlkZ Reg/Trd StcSMem (MB) DymSMem (MB) Bytes (MB) Throughput (MBps) SrcMemKd DstMemKd Device Ctx Strm Name
NVIDIA A100-SXM4-40GB (0) 1 7 void cutlass::Kernel<cutlass_80_tensorop_s1688f16gemm_256x128_32x3_nn_align4>(T1::Params)