When I run the sample from Nvidia on RTX 4070 (30 TFLOP/s peak theoretical), it shows this performance:

Compute 8.9 CUDA device: [NVIDIA GeForce RTX 4070]
number of bodies = 2048000
2048000 bodies, total time for 10 iterations: 56535.691 ms
= 741.886 billion interactions per second
= 14837.721 single-precision GFLOP/s at 20 flops per interaction

This is nearly 50% of peak while reaching 5% higher boost frequency than the default (~2600 MHz). I know that this program is not meant to measure performance, but what are the main reasons of this performance? I guess one is having not equal multiplies & additions because marketed peak values are always about FMA operations that are 1 add & 1 mul. But nbody doesn’t have equal number of them in calculations and there is an extra rsqrt function that may not be using equal number of additions & multiplications under the hood (do special function units even use other fp/int pipelines for its internal work?).

RTX4000 series GPUs have good shared-memory bandwidth. This should not be the issue as this program uses shared-memory-tiling to compute particle-particle interactions.

Throughput of rsqrt was 1/4 of peak in old cards, is it same for Ada series too? Should any kind of Quake-fast-inverse-square-root implementation help in this scenario (especially for the integer capabilities of Ada) to balance the peak-flops / current-flops value?

Lastly, can anyone else with an Ada GPU try this for a comparison of GFLOP/s with the RTX4070 I’m testing?

See section 5.4.1 Arithmetic Instructions of the CUDA Programming Guide. For compute capability 8.6, 8.9, and 9.0, the throughput of the MUFU instructions is 16 versus 128 for single-precision arithmetic operations, so throughput is 1/8 instead of 1/4. Even so, it is not possible to replace MUFU.RSQ with a sequence of integer and FP32 arithmetic instructions with better performance at roughly identical accuracy. Example:

/* compute 1/sqrt(x) on [2**(-126), 2**128) */
__device__ float my_rsqrtf (float a)
{
float r;
#if USE_NATIVE
// maximum error = 1.51449 ulps
asm ("rsqrt.approx.ftz.f32 %0,%1; \n\t" : "=f"(r) : "f"(a));
#else // USE_NATIVE
// maximum error = 1.16684 ulps
r = __int_as_float (0x5f37642f - (__float_as_int(a) >> 1));
r = fmaf (0.5f * r, fmaf (a * r, -r, 1.0f), r);
float e = fmaf (a * r, -r, 1.0f);
r = fmaf (fmaf (0.375f, e, 0.5f), e * r, r);
#endif // USE_NATIVE
return r;
}

If you want to explore the performance bottlenecks in the code, I would suggest using the CUDA profiler.

What about not replacing but load-balancing between MUFU & INT-FP32? Can we somehow get more than 100% of MUFU by using both MUFU and INT/FP32 with a modulo pattern (like using MUFU for threads with id 1,3,5,7,… and FP32 for threads with id 0,2,4,etc) or block-based modulo (first block uses MUFU, second block uses FP32, alternating)

Maybe, if dx,dy,dz values are normalized, can we have same accuracy with polynomial-approximation (pure FMA FP32) and use this to help the MUFU?

Can INT pipeline be used by scaling the FP values and rescaling back? FMA works on INT pipeline too? (to have INT + FP32 + MUFU together for higher throughput than just 1/8 of peak)