An important aspect to use INT8 tensorcore in CUBLAS GEMM style operations is given in the note in the documentation for the cublasGemmEx() function:
CUBLAS_COMPUTE_32IandCUBLAS_COMPUTE_32I_PEDANTICcompute types are only supported with A, B being 4-byte aligned and lda, ldb being multiples of 4. For a better performance, it is also recommended that IMMA kernels requirements for a regular data ordering are met (listed here).
An “IMMA kernel” is an integer tensorcore kernel. So the implication is that specific conditions must be met to use tensorcore for integer work. If we follow that last “listed here” link to the proper place, we see:
To use IMMA kernels, one of the following sets of requirements, with the first being the preferred one, must be met:
- Using a regular data ordering:
- All matrix pointers must be 4-byte aligned. For even better performance, this condition should hold with 16 instead of 4.
- Leading dimensions of matrices A, B, C must be multiples of 4.
- Only the “TN” format is supported - A must be transposed and B non-transposed.
- Dimensions m and k must be multiples of 4.
- Using the IMMA-specific data ordering - CUBLASLT_ORDER_COL32 for matrices A,C,D, and CUBLASLT_ORDER_COL4_4R2_8C (on Turing or Ampere architecture) or CUBLASLT_ORDER_COL32_2R_4R4 (on Ampere architecture) for matrix B:
- Leading dimensions of matrices A, B, C must fulfill conditions specific to the memory ordering (see cublasLtOrder_t).
- Matmul descriptor must specify
CUBLAS_OP_Ton matrix B andCUBLAS_OP_N(default) on matrix A and C.- If scaleType
CUDA_R_32Iis used, the only supported values foralphaandbetaare0or1.
Those are important notes to witness INT8 calculations on Tensorcore. There are two recipes given there, I will follow the first, in particular by choosing appropriate dimensions and choosing A transposed and B non-transposed.
Here is an example using CUDA 12.0 on Ampere A100:
$ cat t2.cu
#include <cublas_v2.h>
#include <iostream>
#ifdef USE_INT8
using mt = char;
using rt = int;
using st = int;
cudaDataType Atype = CUDA_R_8I;
cudaDataType Ctype = CUDA_R_32I;
cublasComputeType_t computeType = CUBLAS_COMPUTE_32I;
#else
// using FP16
#include <cuda_fp16.h>
using mt = half;
using rt = half;
using st = half;
cudaDataType Atype = CUDA_R_16F;
cudaDataType Ctype = CUDA_R_16F;
cublasComputeType_t computeType = CUBLAS_COMPUTE_16F;
#endif
int main(){
int dim = 4096;
int m = dim;
int n = dim;
int k = dim;
mt *A, *B;
rt *C;
cudaMalloc(&A, sizeof(A[0])*m*k);
cudaMalloc(&B, sizeof(B[0])*n*k);
cudaMalloc(&C, sizeof(C[0])*m*n);
st alpha = 1;
st beta = 0;
cublasHandle_t h;
cublasStatus_t stat = cublasCreate(&h);
stat = cublasGemmEx(h,
CUBLAS_OP_T,
CUBLAS_OP_N,
m,
n,
k,
&alpha,
A,
Atype,
dim,
B,
Atype,
dim,
&beta,
C,
Ctype,
dim,
computeType,
CUBLAS_GEMM_DEFAULT);
std::cout << (int)stat << std::endl;
cudaDeviceSynchronize();
cudaError_t err = cudaGetLastError();
std::cout << cudaGetErrorString(err) << std::endl;
}
$ nvcc -o t2 t2.cu -lcublas
$ nsys nvprof --print-gpu-trace ./t2
WARNING: t2 and any of its children processes will be profiled.
0
no error
Generating '/tmp/nsys-report-8be3.qdstrm'
[1/3] [========================100%] report6.nsys-rep
[2/3] [========================100%] report6.sqlite
[3/3] Executing 'gputrace' stats report
Start (ns) Duration (ns) CorrId GrdX GrdY GrdZ BlkX BlkY BlkZ Reg/Trd StcSMem (MB) DymSMem (MB) Bytes (MB) Throughput (MBps) SrcMemKd DstMemKd Device Ctx Strm Name
------------- ------------- ------ ---- ---- ---- ---- ---- ---- ------- ------------ ------------ ---------- ----------------- -------- -------- ------------------------- --- ---- ---------------------------------------------
1,611,077,622 557,631 3,491 16 32 1 256 1 1 174 0.049 0.098 NVIDIA A100-SXM4-40GB (0) 1 7 ampere_h16816gemm_256x128_ldg8_stages_64x3_tn
Generated:
/home/.../report6.nsys-rep
/home/.../report6.sqlite
$ nvcc -o t2 t2.cu -lcublas -DUSE_INT8
$ nsys nvprof --print-gpu-trace ./t2
WARNING: t2 and any of its children processes will be profiled.
0
no error
Generating '/tmp/nsys-report-be40.qdstrm'
[1/3] [========================100%] report7.nsys-rep
[2/3] [========================100%] report7.sqlite
[3/3] Executing 'gputrace' stats report
Start (ns) Duration (ns) CorrId GrdX GrdY GrdZ BlkX BlkY BlkZ Reg/Trd StcSMem (MB) DymSMem (MB) Bytes (MB) Throughput (MBps) SrcMemKd DstMemKd Device Ctx Strm Name
------------- ------------- ------ ---- ---- ---- ---- ---- ---- ------- ------------ ------------ ---------- ----------------- -------- -------- ------------------------- --- ---- -------------------------------------------------------------------------------------------
1,565,091,375 420,671 3,180 512 4 1 128 1 1 156 0.000 0.074 NVIDIA A100-SXM4-40GB (0) 1 7 void cutlass::Kernel<cutlass_80_tensorop_i16832gemm_s8_128x64_128x3_tn_align16>(T1::Params)
Generated:
/home/.../report7.nsys-rep
/home/.../report7.sqlite
$
We see that in the first compilation case (FP16), the kernel invoked is ampere_h16816gemm_256x128_ldg8_stages_64x3_tn which is a FP16 TC kernel, and the kernel duration is ~558 microseconds.
In the second compilation case (INT8), the kernel invoked is cutlass::Kernel<cutlass_80_tensorop_i16832gemm_s8_128x64_128x3_tn_align16>(T1::Params) which is a INT8 TC kernel, and the kernel duration is ~421 microseconds, so somewhat faster than the FP16 kernel.
From what I can see of the documentation, there is no recipe that allows CUBLAS_OP_N on both A and B, if you want to witness INT8 TC usage.