INT8 cublasGemmEx support on Tegra X2 and Tesla P100

samuel.aldana · October 16, 2017, 8:10am

Hello,

I am unsuccessfully trying to run INT8 matrix-matrix multiplication on the following Pascal devices: Tegra X2 and Tesla P100.

I am using the generic cublasGemmEx interface. According to documentation any GPU with CC > 5.0 should support this method but I get:

CUBLAS_STATUS_ARCH_MISMATCH for P100 (although CC=6.0)
CUBLAS_STATUS_NOT_SUPPORTED for TX2 (although CC=6.2)

Same code with FP16 matrices works like a charm.
The exact same same code runs perfectly on other Pascal cards (1070, TitanXP).

Does anyone know what is the actual support for these operations on these cards ?

Any help appreciated. Thank you in advance.

njuffa · October 16, 2017, 2:15pm

What CUDA version are you using? Does CC 6.0 actually provide the specific INT8 operations needed? I thought not:

https://devblogs.nvidia.com/parallelforall/mixed-precision-programming-cuda-8/

samuel.aldana · October 17, 2017, 9:31am

Thank you for the link.

I am using CUDA 8.0.

Do you have any idea about INT8 support Tegra X2?
And what about the GV100? I found it surprisingly difficult to find a clear communication from NVIDIA about this subject?

njuffa · October 17, 2017, 4:52pm

I am not familiar with the Tegra TX2. You could try asking in the dedicated Tegra X2 forum “next door”. I likewise know close to nothing about GV100 other than that it exists. GV100 is a supercomputer class part, not anything I am likely to use any time soon.

Robert_Crovella · October 17, 2017, 7:37pm

At this time, INT8 support is available on cc6.1 and cc7.0 compute capability devices

GV100 = cc7.0 (INT8 is supported)
TX2 = cc6.2 (INT8 not supported)

a simple google search on “INT8 TX2” turns up the TX2 information readily.

Note that when using code optimized for TensorCore (on GV100) the FP16 throughput (peak, theoretical) is higher than the INT8 throughput (peak, theoretical).

INT8 throughput (peak, theoretical) on GV100 is ~4x the FP32 throughput (so 4x15Top/s = 60 Top/s) whereas the peak theoretical FP16 multiply/accumulate throughput for matrix multiply ops on TensorCore is 120TFlops

Topic		Replies	Views
How can I perform GEMM with INT8 in cuBLAS CUDA Programming and Performance	3	2114	February 24, 2017
How can I perform GEMM with INT8 in cuBLAS with DRIVE PX2 General	6	2178	May 18, 2017
About cublasGemm INT8 support GPU-Accelerated Libraries	3	2683	September 15, 2017
Can I calibrate my network on the old graphics card with low compute capability? TensorRT	1	581	November 28, 2018
Nvidia announces Tesla V100 (Volta) CUDA Programming and Performance	19	5233	November 30, 2017
INT 32 and FP64 can be used concurrently in the Volta architecture? CUDA Programming and Performance	5	2510	May 6, 2024
cublasGemmEx doesn't work with INT8 utilizing __dp4a instruction on NVIDIA 1080TI CUDA Programming and Performance	12	3639	September 25, 2017
FP16 support on gtx 1060 and 1080 GPU-Accelerated Libraries math-api	14	25723	May 19, 2021
SGEMM FP16 compute? CUDA Programming and Performance	6	3826	December 4, 2016
Cuda 9 FP16 CUDA Programming and Performance	5	1839	August 5, 2017

INT8 cublasGemmEx support on Tegra X2 and Tesla P100

Related topics