We tried to use GEMM with INT8 (using cuBLAS GEMMEX API), but we met the following issues,
-
In our typical settings, M=768, N=786432, K=128, GEMM with INT8 (volta_sgemm_int8_128x128_nt) is much slower than FP16 (turing_h1688gemm_128x128_ldg8_nt), 21.443ms vs. 8.6957ms. I changed to CUDA version from 10.1 to 11.2, the performane results are same.
-
We would like to use UINT8 instead of INT8, How to configure the cublasGemmEx? It is not clear in the cuBLAS manual. I try to use CUDA_R_8U instead of CUDA_R_8I, but the results seems wrong.
Our benchmark code: GitHub - Junsong-Wang/cuBLASTest
The test is performed in Telas T4 card, with Driver Version: 418.181.07, CUDA Version: 10.1
Attached the Test Results:
root@c0dca262005a:~/cuBLASTest/build# nvprof ./cublastest
==7890== NVPROF is profiling process 7890, command: ./cublastest
===== start to test HGEMM, M=768, N=786432, K=128, test iterations:16 =====
FP16, total Time (timeofday) in 16 interations is 1.91351s.
===== start to test GEMMEx(INT8), M=768, N=786432, K=128, test iterations:16 =====
INT8, total Time (timeofday) in 16 interations is 3.74584s.
==7890== Profiling application: ./cublastest
==7890== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 90.96% 5.17309s 32 161.66ms 110.38ms 213.00ms [CUDA memcpy DtoH]
6.03% 343.08ms 16 21.443ms 21.181ms 21.899ms volta_sgemm_int8_128x128_nt
2.45% 139.13ms 16 8.6957ms 7.7914ms 12.507ms turing_h1688gemm_128x128_ldg8_nt
0.56% 31.810ms 5 6.3621ms 2.0160us 20.800ms [CUDA memcpy HtoD]
API calls: 59.41% 5.68842s 36 158.01ms 65.628us 234.50ms cudaMemcpy2D
22.32% 2.13684s 8 267.10ms 33.173us 1.18106s cudaHostAlloc
10.39% 995.07ms 9 110.56ms 1.0420us 652.45ms cudaFree
7.79% 746.06ms 6 124.34ms 59.643us 447.81ms cudaFreeHost
0.05% 4.8069ms 6 801.16us 61.777us 2.4387ms cudaMallocPitch
0.02% 1.7026ms 32 53.204us 29.020us 69.090us cudaLaunchKernel
0.01% 852.75us 3 284.25us 277.59us 295.24us cuDeviceTotalMem
0.01% 610.51us 285 2.1420us 158ns 97.318us cuDeviceGetAttribute
0.00% 414.18us 3 138.06us 7.6690us 384.40us cudaMalloc
0.00% 276.94us 80 3.4610us 933ns 13.357us cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
0.00% 123.31us 169 729ns 428ns 7.0680us cudaFuncSetAttribute
0.00% 119.43us 3 39.809us 33.549us 46.820us cuDeviceGetName
0.00% 40.201us 1 40.201us 40.201us 40.201us cudaMemcpy
0.00% 17.637us 16 1.1020us 517ns 7.7770us cudaEventCreateWithFlags
0.00% 12.096us 32 378ns 225ns 577ns cudaGetLastError
0.00% 8.4200us 1 8.4200us 8.4200us 8.4200us cuDeviceGetPCIBusId
0.00% 6.4310us 11 584ns 345ns 1.7760us cudaDeviceGetAttribute
0.00% 5.9760us 2 2.9880us 2.9040us 3.0720us cuInit
0.00% 5.0820us 1 5.0820us 5.0820us 5.0820us cudaGetDevice
0.00% 3.7190us 5 743ns 250ns 2.1850us cuDeviceGetCount
0.00% 2.0360us 4 509ns 189ns 983ns cuDeviceGet
0.00% 1.3220us 2 661ns 526ns 796ns cuDriverGetVersion
0.00% 884ns 3 294ns 290ns 304ns cuDeviceGetUuid