Tensor WMMA INT8 vs FP16 processing speed

cudapop1 · February 14, 2019, 6:41pm

Hi all,

I recently got an RTX card and wanted to test out the speed when using the new INT8 mode of the Turing tensor cores vs. the regular FP16 mode.

I used the sample code from the “Programming Tensor Cores in CUDA 9” developer blog (code-samples/posts/tensor-cores at master · NVIDIA-developer-blog/code-samples · GitHub) and modified it slightly so I had one kernel that did FP16 WMMA and another kernel that did INT8 WMMA.

__global__ void wmma_example_f16(half *a, half *b)
{
   // Tile using a 2D grid
   int warpM = (blockIdx.x * blockDim.x + threadIdx.x) / warpSize;
   int warpN = (blockIdx.y * blockDim.y + threadIdx.y);

   // Declare the fragments
   wmma::fragment<wmma::matrix_a, WmmaDim, WmmaDim, WmmaDim, half, wmma::col_major> a_frag;
   wmma::fragment<wmma::matrix_b, WmmaDim, WmmaDim, WmmaDim, half, wmma::col_major> b_frag;
   wmma::fragment<wmma::accumulator, WmmaDim, WmmaDim, WmmaDim, float> acc_frag;

   wmma::fill_fragment(acc_frag, 0);

   // Loop over k
   for (int i = 0; i < MatDim; i += WmmaDim) {
      int aRow = warpM * WmmaDim;
      int aCol = i;

      int bRow = i;
      int bCol = warpN * WmmaDim;

      // Bounds checking
      if (aRow < MatDim && aCol < MatDim && bRow < MatDim && bCol < MatDim) {
         // Load the inputs
         wmma::load_matrix_sync(a_frag, a + aRow + aCol * MatDim, MatDim);
         wmma::load_matrix_sync(b_frag, b + bRow + bCol * MatDim, MatDim);
 
         // Perform the matrix multiplication
         wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag);
      }
   }
} // wmma_example_f16

__global__ void wmma_example_i8(signed char *a, signed char *b)
{
   // Tile using a 2D grid
   int warpM = (blockIdx.x * blockDim.x + threadIdx.x) / warpSize;
   int warpN = (blockIdx.y * blockDim.y + threadIdx.y);

   // Declare the fragments
   wmma::fragment<wmma::matrix_a, WmmaDim, WmmaDim, WmmaDim, signed char, wmma::col_major> a_frag;
   wmma::fragment<wmma::matrix_b, WmmaDim, WmmaDim, WmmaDim, signed char, wmma::col_major> b_frag;
   wmma::fragment<wmma::accumulator, WmmaDim, WmmaDim, WmmaDim, int> acc_frag;

   wmma::fill_fragment(acc_frag, 0);

   // Loop over k
   for (int i = 0; i < MatDim; i += WmmaDim) {
      int aRow = warpM * WmmaDim;
      int aCol = i;

      int bRow = i;
      int bCol = warpN * WmmaDim;

      // Bounds checking
      if (aRow < MatDim && aCol < MatDim && bRow < MatDim && bCol < MatDim) {
         // Load the inputs
         wmma::load_matrix_sync(a_frag, a + aRow + aCol * MatDim, MatDim);
         wmma::load_matrix_sync(b_frag, b + bRow + bCol * MatDim, MatDim);
 
         // Perform the matrix multiplication
         wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag);
      }
   }
} // wmma_example_i8

In both kernels, I’ve removed the code which stores the result fragment back out to global memory just for brevity.

I verified that the compiler does compile the “loop over K” for-loops by generating PTX files and checking that the PTX contains the “wmma.load” and “wmma.mma.sync” commands. I’ve also verified in the PTX that the INT8 version is using the “wmma.mma.sync.aligned.col.col.m16n16k16.s32.s8.s8.s32” command, while the FP16 version is using the “wmma.mma.sync.aligned.col.col.m16n16k16.f32.f32” command.

The weird thing is that both kernels show almost the same execution time (timed via CUDA events). For example: using 2048x2048 matrices, they both show around 0.11 ms execution times (on an RTX 2060) regardless of it being the INT8 kernel or FP16 kernel being run.

Since INT8 mode is supposed to have double the throughput of FP16 mode, I was expecting the INT8 kernel to execute much faster than the FP16 kernel.

Anyone have any ideas why they both show almost the same execution times?

(I’ve also verified that the compiler is not “optimizing away” the for-K-loop, even without any write out to global memory in the kernel code, because the execution times change depending on the sizes of the arrays)

cudapop1 · February 14, 2019, 6:46pm

Forgot to add: in the code above, MatDim is defined as 2048, and WmmaDim is defined as 16.

As a further test, in order to isolate the matmul operation from the memory read, I also tried removing the 2 lines which contain the “wmma::load_matrix_sync” commands. This way the for-loop is just executing the “wmma::mma_sync” command.

With that change, I verified in the resulting PTX files that the loop still contains the "“wmma.mma.sync.aligned” PTX instructions. I also verified that the execution times still varies according to the matrix size (ie. value of MatDim).

But even with this change, both the INT8 kernel and FP16 kernel still execute in the same amount of time. I’m still not getting the “double throughput” I was expecting from the INT8 mode.

Any ideas?

Topic		Replies	Views
cuBLAS INT8 tensor core mode vs. FP16 mode GPU-Accelerated Libraries cublas	13	5967	December 5, 2022
cuBLAS INT8 tensor core mode vs. FP16 mode GPU-Accelerated Libraries	0	924	February 15, 2019
Why is' int8 'not as fast as' fp16' TensorRT tensorrt	1	633	February 1, 2021
Same inference speed for INT8 and FP16 TensorRT	10	6238	October 12, 2021
Int8 is not faster than fp16 on xavier Jetson AGX Xavier tensorrt	5	851	October 18, 2021
TRT Engin in INT8 is much slower than FP16 TensorRT	4	2093	November 11, 2021
Int8 is 30% slower than fp16 in cudnn_samples_v8/conv_sample cuDNN	4	854	February 8, 2023
Little performance difference between int8 and fp16 on RTX2080 TensorRT	4	2736	July 5, 2021
Estimating convolution performance with mixed precision GPU-Accelerated Libraries	0	525	August 28, 2018
Error while using different precisions with tensor cores Deep Learning (Training & Inference) mixed-precision	2	2069	July 26, 2019

Tensor WMMA INT8 vs FP16 processing speed

Related topics