A100 operations
Is it possible to have Cuda and Tensor core operations simultaneously? Would it raise the amount of TFLOPS by a small portion?
A100 operations
Is it possible to have Cuda and Tensor core operations simultaneously? Would it raise the amount of TFLOPS by a small portion?
On Volta - Ampere architecture the SM consisted of 4 sub-partitions. Each sub-partition has a warp scheduler, register file, and execution units. The warp scheduler can dispatch 1 instruction per cycle. Tensor (*MMA) instructions are issued in 1 cycle. On the next cycle the warp scheduler can issue instructions to the FMA pipe (FP32, INT32), ALU pipe (INT, bit manipulation), XU (transcendental), LSU (SHMEM, global, local) or TEX unit. The following additional restrictions exist during 1-N cycles after a Tensor instructions:
These cycles are generally used for post-processing or for address and data movement necessary to feed the Tensor cores.