xcq_88
February 27, 2019, 3:32am
1
Hi, I used cublasGemmEx to test int8 Performance with CUBLAS_GEMM_DEFAULT_TENSOR_OP algo. Also I call cublasSetMathMode to CUBLAS_TENSOR_OP_MATH.
1.matrix A: (8192x8192), CUBLAS_OP_N, CUDA_R_8I.
2.matrix B: (8192x8192), CUBLAS_OP_T, CUDA_R_8I.
3.matrix C: (8192x8192), , CUDA_R_32I.
4.compute mode: CUDA_R_32I.
5.alpha,beta: (1.0, 0.0).
6.algo: CUBLAS_GEMM_DEFAULT
But the program only reached 5.4TOPS, not 22TOPS as the NVIDIA xavier specs.
I think maybe cublasGemmEx(Int8) donnot use tensor core. The number of int8 performance 5.4TOPS is likely to be 4 times of Int32(1.3TOPS, 1.5TOPS according to specs, 1.377GHz512Core 2 Ops= 1.4TOPS).
So how to test int8 performance on tensor core?
Hi,
Have you maximized the CPU/GPU performance first?
sudo ./jeston_clocks.sh
To have further investigation, could you share the source file with us?
We want to reproduce this issue in our environment.
Thanks.
xcq_88
February 28, 2019, 3:04am
3
AastaLLL:
Hi,
Have you maximized the CPU/GPU performance first?
sudo ./jeston_clocks.sh
To have further investigation, could you share the source file with us?
We want to reproduce this issue in our environment.
Thanks.
I already set the model, and maximized the clocks.
$ sudo nvpmodel -m 0
$ sudo nvpmodel -q --verbose
$ sudo ./jetson_clocks.sh
$ sudo ./jetson_clocks.sh –show
I want to know which program you test the int8 performance(22TOPS). Not gemm???
And FP16 can reach 8.1 TFLOPS of 11 TFLOPS(cublasHgemm), I thinks that’s reasonable.
But INT8 performance 5.4TOPS is far away from the 22TOPS in docs.
Hi,
Sorry for the late update.
The value is measured with INT8 GEMM.
We will check this with our internal team to see if any extra setting is applied.
Thanks.
xcq_88
April 11, 2019, 6:46am
5
use cudnnConvolutionForward function with type CUDNN_TENSOR_NCHW_VECT_C can reach 22TLOPS.
Hi,
Thanks for your update
We can close this topic now.
Hi, xcq_88,
Could you share the code that reached 22TOPS?
Thanks a lot!
It will be great if you can share the code, thank you!