Turing Arch - INT4 ops with tensor cores

Hi guys, is there currently any way to perform INT4 ops with turing tensor cores? CuBLAS only allows float16 and float32, according to https://docs.nvidia.com/cuda/cublas/index.html#cublassetmathmode

The CuDNN docs say int8 data types are available, but only on sm_72 (not Turing I think) https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#tensor-ops-speedup-tips

Is a new API coming out soon or something like that? Cheers.

You need to use the sub-byte WMMA experimental features to perform INT4 tensor core operations, see: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#wmma-subbyte