INT8 cublasGemmEx support on Tegra X2 and Tesla P100

Hello,

I am unsuccessfully trying to run INT8 matrix-matrix multiplication on the following Pascal devices: Tegra X2 and Tesla P100.

I am using the generic cublasGemmEx interface. According to documentation any GPU with CC > 5.0 should support this method but I get:

  • CUBLAS_STATUS_ARCH_MISMATCH for P100 (although CC=6.0)
  • CUBLAS_STATUS_NOT_SUPPORTED for TX2 (although CC=6.2)

Same code with FP16 matrices works like a charm.
The exact same same code runs perfectly on other Pascal cards (1070, TitanXP).

Does anyone know what is the actual support for these operations on these cards ?

Any help appreciated. Thank you in advance.

What CUDA version are you using? Does CC 6.0 actually provide the specific INT8 operations needed? I thought not:

https://devblogs.nvidia.com/parallelforall/mixed-precision-programming-cuda-8/

Thank you for the link.

I am using CUDA 8.0.

Do you have any idea about INT8 support Tegra X2?
And what about the GV100? I found it surprisingly difficult to find a clear communication from NVIDIA about this subject?

I am not familiar with the Tegra TX2. You could try asking in the dedicated Tegra X2 forum “next door”. I likewise know close to nothing about GV100 other than that it exists. GV100 is a supercomputer class part, not anything I am likely to use any time soon.

At this time, INT8 support is available on cc6.1 and cc7.0 compute capability devices

GV100 = cc7.0 (INT8 is supported)
TX2 = cc6.2 (INT8 not supported)

a simple google search on “INT8 TX2” turns up the TX2 information readily.

Note that when using code optimized for TensorCore (on GV100) the FP16 throughput (peak, theoretical) is higher than the INT8 throughput (peak, theoretical).

INT8 throughput (peak, theoretical) on GV100 is ~4x the FP32 throughput (so 4x15Top/s = 60 Top/s) whereas the peak theoretical FP16 multiply/accumulate throughput for matrix multiply ops on TensorCore is 120TFlops