Does Nvidia Titan x have native FP16 and int8 support?

Anantech article that came out in July says NVIDIA Titan X will have int8 support. Does anyone know anything about this?

https://blogs.nvidia.com/blog/2016/07/21/titan-x/
Here are its numbers:

  • 11 TFLOPS FP32
  • 44 TOPS INT8 (new deep learning inferencing instruction) <<<<<<<<<<<<<
  • 12B transistors
  • 3,584 CUDA cores at 1.53GHz (versus 3,072 cores at 1.08GHz in previous TITAN X)
  • Up to 60% faster performance than previous TITAN X
  • High performance engineering for maximum overclocking
  • 12 GB of GDDR5X memory (480 GB/s)

I saw that announcement. Did anyone get a chance to test the performance of 8bit MAD on the actual physical card?

I don’t see any builtin data type /math functions for int8 in cuda programming guide that came with cuda toolkit 8.0RC. Will it be released in the full version of cuda toolkit 8.0? If 8bit support is present in Nvidia Titan X, how do we access it and test it’s performance?

Since C/C++ have had 8-bit integer data types for a very long time, and CUDA supports short-vector 8-bit integer types such as ‘uchar4’, I am not sure what kind of new data type would be needed?

If your question is, “Does CUDA 8.0 provide new device function intrinsics to access the new DP2A and DP4A instructions”, I don’t know the answer to that. But I would expect these instructions to be accessible from inline PTX at a minimum. Have you checked the latest PTX specification?

Int8 support, meaning 4 parallel byte multiply-accumulates, is supported by all Kepler, Maxwell, and Pascal NVidia cards (sm 3.0 and later). It’s performed in CUDA PTX by the vmad instruction.

fp16x2, defined as 2 parallel 16 bit IEEE floating point fused multiply/accumulates, is supported by P100 and also surprisingly by X1, the Maxwell based ARM SoC.

As Norbert says, DP2A and DP4A are new byte and word dot-product-and-accumulate device 6.1 instructions on GP106, GP104, and GP102, but not P100.

I was expecting a device intrinsic for 8 bit more like _hadd and _hfma for half floats. I haven’t worked with vmad before. It’s a scalar 32 bit mad operation. Of course, we can pack 4 8 bits and do it but how different it is from dp4a instruction for 8bit MAD? Did anyone test dp4a on GTX 1080 and nvidia titan X and seen the throughput? Is it 4x?

It’s scalar instruction, so it perfroms only 1 MAD. simd instructions are described in http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#simd-video-instructions and was hardware implemented only in Kepler, with 1/4 throughput, so the overall throughput was still 1 MAD/cycle