Turing Tensor core int4 operation

Can TensorRT support 4bit integer quantification? I cannot find any example in TensorRT example source code.
If not, how to implement it by ourselves?


Please reference TensorRT support matrix, which lists the TensorRT layers, hardware, and the precision modes that each layer supports.

reference https://docs.nvidia.com/deeplearning/sdk/tensorrt-support-matrix/index.html#layers-precision-matrix

According to https://docs.nvidia.com/deeplearning/sdk/tensorrt-support-matrix/index.html#layers-precision-matrix
Table 3.

INT8 is pretty much not supported in TensorRT 5.0.4 except some data rearrange layer. But if I compile sampleINT8API example in GeForce 2070 hardware, the inference time is about 3x faster comparing with float32, and about 40% faster than float16.

How can it be faster if it is not supported?


2070 has a CUDA 7.5 compute compatibility version, which per https://docs.nvidia.com/deeplearning/sdk/tensorrt-support-matrix/index.html#hardware-precision-matrix supports INT8 precision mode.

For more details on 8bit inference on TRT, please see: