Performance when using 8-bit operations

kalyan.c · July 22, 2019, 5:09am

Hi all,

How to get desired latency when we are running on 8 bit data.I have tried using cuda gemm functions there is no difference in latency for single precision gemm and gemmEx using 8 bit flags. I also tried using dot product intrinsic “dp4a” no difference.

Thanks,
kalyan ch

tera · July 22, 2019, 8:39pm

GPUs are generally not optimized for latency, and 8-bit data accesses will suffer the same latency as 32-bit accesses. Did you mean “throughput”?

kalyan.c · July 24, 2019, 11:10am

No, not throughput.In terms of number of operations per time.

Robert_Crovella · July 24, 2019, 2:03pm

that sure looks like throughput to me. I think that might actually be the definition of throughput.

I can certainly demonstrate an increase in throughput for 8 bit int gemm compared to single precision gemm, in the right setting/setup. So not sure what it is you are asking.

Perhaps you should provide the code you used to make the measurement in both cases.

kalyan.c · July 24, 2019, 3:08pm

hi,

I am checking for these two lines of code,

cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N,4096, 4096, 4096, alpha, A1, 4096, B1, 4096, beta, D, 4096);

cublasSgemmEx(handle,CUBLAS_OP_N,CUBLAS_OP_N,4096,4096,4096,alpha,A,CUDA_R_8I,4096,B,CUDA_R_8I,4096,beta,C,CUDA_R_32F,4096);

and they got almost same.

kalyan.c · July 24, 2019, 3:17pm

Ok sorry.then yes i am looking for throughput.

Robert_Crovella · July 24, 2019, 3:22pm

I notice you are specifying SgemmEx with CUDA_R_32F for the output type. If I wanted to see a significant speed up of INT8 vs. FP32 I would use GemmEx with:

computeType	Atype/Btype	Ctype
CUDA_R_32I       CUDA_R_8I      CUDA_R_32I

Beyond that, I wouldn’t be able to help you unless you provide complete codes. It’s also important to know the platform and GPU you are running on.

If you want to provide all that information, I’ll take a look as time permits.

kalyan.c · July 25, 2019, 7:03am

Thanks Robert_Crovella, with GemmEx i have seen difference in exection time for float and char.

Topic		Replies	Views
cublasGemmEx doesn't work with INT8 utilizing __dp4a instruction on NVIDIA 1080TI CUDA Programming and Performance	12	3666	September 25, 2017
cuBLAS GEMM INT8 is much slower than FP16 in T4 GPU-Accelerated Libraries cublas	11	4445	November 2, 2023
About cublasGemm INT8 support GPU-Accelerated Libraries	3	2708	September 15, 2017
Cublas and Cutlas 8bit GEMM matrix size constraints GPU-Accelerated Libraries	0	692	June 30, 2020
INT8 cublasGemmEx support on Tegra X2 and Tesla P100 GPU-Accelerated Libraries	4	1816	October 17, 2017
Xavier Tensor Core int8 Peformance cannot reach 22TOPS with cublasGemmEx API? Jetson AGX Xavier	8	918	October 18, 2021
cublasGemmEx cant use CUDA_R_8I compute type on GTX1080 GPU-Accelerated Libraries	4	1376	February 12, 2018
How can I perform GEMM with INT8 in cuBLAS CUDA Programming and Performance	3	2134	February 24, 2017
Ada GeForce (RTX 4090) FP8 cuBLASLt performance GPU-Accelerated Libraries cublas	7	12822	November 2, 2023
cuBLAS INT8 tensor core mode vs. FP16 mode GPU-Accelerated Libraries cublas	13	5510	December 5, 2022

Performance when using 8-bit operations

Related topics