Maximizing FLOPS

Hi everybody,

I am working on maximizing the performance of a kernel which does element-wise multiplication of a large matrix. From this link it seems that most modern GPU architectures (including the one I’m using) are capable of preforming 2 floating point operations every cycle (for 32-bit precision). To take advantage of this, does NVCC automatically generate assembly code which uses this functionality, or is it something that needs to be programmed into the CUDA code? For example, in a kernel which does two separate multiplications one after the other, will the generated ptx code do both multiplications simultaneously? Or is that something that needs to be specified in the CUDA code?


NVIDIA GPUs support FMA, floating point multiply and accumulate, in a single instruction. The compiler will to generate this instruction. The CUDA binary utilities and the CUDA profilers support output of the SASS assembly code. Moreover, the CUDA profilers can collect and show you the FLOP/s count and the instruction executed count.

CC 3.0 and above can only issue 1 math instruction per cycle per SM warp scheduler (4 schedulers per SM).

SASS is the final GPU µcode. I would highly recommend avoiding drawing conclusions from PTX which is an intermediate language. The CUDA profilers with SASS>PTX>Source correlation will provide you the best method to understand the code generation and performance bottlenecks in your code.


Thanks for the response Greg,
Does this mean that the number of FLOP per cycle is actually limited by the number of SM, rather than the number of total CUDA cores? For example, my GPU has 64 CUDA cores per SM, so do I only get 4 FLOP per cycle in total from 64 CUDA cores?

Volta - Turing SM have 64 CUDA cores per SM. The actual implementation is that each SM has 4 sub-partitions (warp scheduler). Each sub-partition has 1 FMA pipe that is 16 threads wide. A warp instruction is issued to the pipe over 2 cycles. This means that a SM can execute 64 FP32 thread instructions (not warp instructions) per cycle.

If the instructions executed are FADD or FMUL then OPS/thread is 1 so FLOPs/cycle/SM is 64.
If the instructions executed are FMA then OPS/thread is 2 so FLOPs/cycle/SM is 128.

Maximum FLOPS = SMcount x CUDACores/SM x 2 FLOPs/CUDACore x GPU Frequency

The maximum FLOPS requires all FP32 instructions to be FMA. If only FADD or FMUL are executed then the code can at best achieve 50% of the maximum FLOPS.

1 Like