Performance when using 8-bit operations

Hi all,

How to get desired latency when we are running on 8 bit data.I have tried using cuda gemm functions there is no difference in latency for single precision gemm and gemmEx using 8 bit flags. I also tried using dot product intrinsic “dp4a” no difference.

kalyan ch

GPUs are generally not optimized for latency, and 8-bit data accesses will suffer the same latency as 32-bit accesses. Did you mean “throughput”?

No, not throughput.In terms of number of operations per time.

that sure looks like throughput to me. I think that might actually be the definition of throughput.

I can certainly demonstrate an increase in throughput for 8 bit int gemm compared to single precision gemm, in the right setting/setup. So not sure what it is you are asking.

Perhaps you should provide the code you used to make the measurement in both cases.


I am checking for these two lines of code,

cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N,4096, 4096, 4096, alpha, A1, 4096, B1, 4096, beta, D, 4096);


and they got almost same.

Ok sorry.then yes i am looking for throughput.

I notice you are specifying SgemmEx with CUDA_R_32F for the output type. If I wanted to see a significant speed up of INT8 vs. FP32 I would use GemmEx with:

computeType	Atype/Btype	Ctype
CUDA_R_32I       CUDA_R_8I      CUDA_R_32I

Beyond that, I wouldn’t be able to help you unless you provide complete codes. It’s also important to know the platform and GPU you are running on.

If you want to provide all that information, I’ll take a look as time permits.

Thanks Robert_Crovella, with GemmEx i have seen difference in exection time for float and char.