Maximizing FLOPS

john.mcinnis · July 17, 2020, 11:37am

Hi everybody,

I am working on maximizing the performance of a kernel which does element-wise multiplication of a large matrix. From this link FLOPS - Wikipedia it seems that most modern GPU architectures (including the one I’m using) are capable of preforming 2 floating point operations every cycle (for 32-bit precision). To take advantage of this, does NVCC automatically generate assembly code which uses this functionality, or is it something that needs to be programmed into the CUDA code? For example, in a kernel which does two separate multiplications one after the other, will the generated ptx code do both multiplications simultaneously? Or is that something that needs to be specified in the CUDA code?

Thanks,
John

Greg · July 17, 2020, 9:42pm

NVIDIA GPUs support FMA, floating point multiply and accumulate, in a single instruction. The compiler will to generate this instruction. The CUDA binary utilities and the CUDA profilers support output of the SASS assembly code. Moreover, the CUDA profilers can collect and show you the FLOP/s count and the instruction executed count.

CC 3.0 and above can only issue 1 math instruction per cycle per SM warp scheduler (4 schedulers per SM).

SASS is the final GPU µcode. I would highly recommend avoiding drawing conclusions from PTX which is an intermediate language. The CUDA profilers with SASS>PTX>Source correlation will provide you the best method to understand the code generation and performance bottlenecks in your code.

john.mcinnis · July 20, 2020, 12:57pm

Thanks for the response Greg,
Does this mean that the number of FLOP per cycle is actually limited by the number of SM, rather than the number of total CUDA cores? For example, my GPU has 64 CUDA cores per SM, so do I only get 4 FLOP per cycle in total from 64 CUDA cores?

Greg · July 22, 2020, 5:09pm

Volta - Turing SM have 64 CUDA cores per SM. The actual implementation is that each SM has 4 sub-partitions (warp scheduler). Each sub-partition has 1 FMA pipe that is 16 threads wide. A warp instruction is issued to the pipe over 2 cycles. This means that a SM can execute 64 FP32 thread instructions (not warp instructions) per cycle.

If the instructions executed are FADD or FMUL then OPS/thread is 1 so FLOPs/cycle/SM is 64.
If the instructions executed are FMA then OPS/thread is 2 so FLOPs/cycle/SM is 128.

Maximum FLOPS = SMcount x CUDACores/SM x 2 FLOPs/CUDACore x GPU Frequency

The maximum FLOPS requires all FP32 instructions to be FMA. If only FADD or FMUL are executed then the code can at best achieve 50% of the maximum FLOPS.

Topic		Replies	Views
Counting FLOPS based on SASS code. CUDA Programming and Performance	2	1065	September 27, 2016
Realistic FLOPS Estimates CUDA Programming and Performance cuda , kernel	2	973	October 12, 2021
instruction or operation CUDA Programming and Performance	16	3994	March 28, 2019
Benchmarking a program What is the best option for finding the FLOP for a given thread? CUDA Programming and Performance	10	1340	August 21, 2010
flops calculation by profiler / of maximum CUDA Programming and Performance	6	14385	August 7, 2008
Cuda operations along side Tensor operations CUDA Programming and Performance	2	537	October 12, 2021
NSight : How to calculate FLOP/s that's close to achieved FLOP/s CUDA Programming and Performance	3	3390	October 4, 2017
Number of 64 bit floating point operations per clock cycle? CUDA Programming and Performance	2	3984	July 8, 2014
how to calculate theoretical fp32 instructions per cycle (IPC) on nvidia GPU CUDA Programming and Performance	6	5695	July 9, 2017
gigaflops CUDA Programming and Performance	16	16622	September 11, 2008

Maximizing FLOPS

Related topics