I understand that processor manufactures, including Nvidia, like to show theoretical FLOPS values for products which are derived from the specifications of the GPU. What I am wondering is how close to the theoretical FLOPS value is actually achievable in practice. For example using a program which would preform very repetitive FMA operations, using all the cores available, how close to the maximum FLOPS count would be realistic? If the generated assembly code ended up being just a load, fma, and store instruction for each thread, would it be around 1/3 of the listed FLOPS?
If the floating-point units are the bottleneck (i.e., high computational intensity), a reasonable first order estimate for well-optimized compiled code would be 75% of theoretical peak. An example would be BLAS3 GEMM-style matrix multiply.
However, in your chosen example memory throughput is the bottleneck (i.e. very low computational intensity). An example would be BLAS1 AXPY-style vector processing. In those circumstances the achieved FLOPS may just be 1/20 or even less of the theoretical peak FLOPS. It will depend on peak memory bandwidth, memory access patterns, data type.
These general “efficiency” effects are quite similar between modern CPUs and GPUs. It might be instructive to look at published data for HPL (High Performance Linpack) for the first scenario and HPCG (High Performance Conjugate Gradients) for the second scenario.