Cuda operations along side Tensor operations

A100 operations

Is it possible to have Cuda and Tensor core operations simultaneously? Would it raise the amount of TFLOPS by a small portion?

On Volta - Ampere architecture the SM consisted of 4 sub-partitions. Each sub-partition has a warp scheduler, register file, and execution units. The warp scheduler can dispatch 1 instruction per cycle. Tensor (*MMA) instructions are issued in 1 cycle. On the next cycle the warp scheduler can issue instructions to the FMA pipe (FP32, INT32), ALU pipe (INT, bit manipulation), XU (transcendental), LSU (SHMEM, global, local) or TEX unit. The following additional restrictions exist during 1-N cycles after a Tensor instructions:

  • On GV100 and GA100 FP64 math instructions cannot be issued
  • On GV100 and TU10x FP16x2 instructions cannot be issued

These cycles are generally used for post-processing or for address and data movement necessary to feed the Tensor cores.

2 Likes