How to get desired latency when we are running on 8 bit data.I have tried using cuda gemm functions there is no difference in latency for single precision gemm and gemmEx using 8 bit flags. I also tried using dot product intrinsic “dp4a” no difference.
that sure looks like throughput to me. I think that might actually be the definition of throughput.
I can certainly demonstrate an increase in throughput for 8 bit int gemm compared to single precision gemm, in the right setting/setup. So not sure what it is you are asking.
Perhaps you should provide the code you used to make the measurement in both cases.
I notice you are specifying SgemmEx with CUDA_R_32F for the output type. If I wanted to see a significant speed up of INT8 vs. FP32 I would use GemmEx with: