Xavier Tensor Core int8 Peformance cannot reach 22TOPS with cublasGemmEx API?

Hi, I used cublasGemmEx to test int8 Performance with CUBLAS_GEMM_DEFAULT_TENSOR_OP algo. Also I call cublasSetMathMode to CUBLAS_TENSOR_OP_MATH.

1.matrix A: (8192x8192), CUBLAS_OP_N, CUDA_R_8I.
2.matrix B: (8192x8192), CUBLAS_OP_T, CUDA_R_8I.
3.matrix C: (8192x8192), , CUDA_R_32I.
4.compute mode: CUDA_R_32I.
5.alpha,beta: (1.0, 0.0).

But the program only reached 5.4TOPS, not 22TOPS as the NVIDIA xavier specs.

I think maybe cublasGemmEx(Int8) donnot use tensor core. The number of int8 performance 5.4TOPS is likely to be 4 times of Int32(1.3TOPS, 1.5TOPS according to specs, 1.377GHz512Core2 Ops= 1.4TOPS).

So how to test int8 performance on tensor core?


Have you maximized the CPU/GPU performance first?

sudo ./jeston_clocks.sh

To have further investigation, could you share the source file with us?
We want to reproduce this issue in our environment.


I already set the model, and maximized the clocks.

sudo nvpmodel -m 0 sudo nvpmodel -q --verbose

sudo ./jetson_clocks.sh sudo ./jetson_clocks.sh –show

I want to know which program you test the int8 performance(22TOPS). Not gemm???

And FP16 can reach 8.1 TFLOPS of 11 TFLOPS(cublasHgemm), I thinks that’s reasonable.

But INT8 performance 5.4TOPS is far away from the 22TOPS in docs.


Sorry for the late update.

The value is measured with INT8 GEMM.
We will check this with our internal team to see if any extra setting is applied.


use cudnnConvolutionForward function with type CUDNN_TENSOR_NCHW_VECT_C can reach 22TLOPS.


Thanks for your update
We can close this topic now.

Hi, xcq_88,

Could you share the code that reached 22TOPS?
Thanks a lot!

It will be great if you can share the code, thank you!