As far as I know the tensor cores in the GV100 chip only support floating point numerics (HMMA). The following article states, that the GV10B of the Jetson AGX Xavier also supports (U)INT8 (IMMA): https://developer.nvidia.com/blog/nvidia-jetson-agx-xavier-32-teraops-ai-robotics/
So the GV10B tensor cores are a bit like Turing tensor cores. That lead me to the following question:
The CUDA Toolkit 10.2 as in the current JetPack release supports BMMA experimentally, can the Jetson AGX Xavier perform these operations natively? Sadly, the CUDA docs are a bit lacking for compute capability 7.2 and there is no clear statement for this. Only 7.0 and 7.5 are mentioned explicitly as far as I can tell. If I missed something, I am happy to stand corrected :)
Background is that I want to accelerate the hamming distance between two matrices like AxN and NxB, where the distance of the nth col/row is
popcount(a_n XOR b_n). That is exactly what the BMMA operation does as far as I understand.