About CUDA nbody sample performance comparison

tugrul_192bit · July 29, 2023, 6:30pm

When I run the sample from Nvidia on RTX 4070 (30 TFLOP/s peak theoretical), it shows this performance:

Compute 8.9 CUDA device: [NVIDIA GeForce RTX 4070]
number of bodies = 2048000
2048000 bodies, total time for 10 iterations: 56535.691 ms
= 741.886 billion interactions per second
= 14837.721 single-precision GFLOP/s at 20 flops per interaction

This is nearly 50% of peak while reaching 5% higher boost frequency than the default (~2600 MHz). I know that this program is not meant to measure performance, but what are the main reasons of this performance? I guess one is having not equal multiplies & additions because marketed peak values are always about FMA operations that are 1 add & 1 mul. But nbody doesn’t have equal number of them in calculations and there is an extra rsqrt function that may not be using equal number of additions & multiplications under the hood (do special function units even use other fp/int pipelines for its internal work?).

RTX4000 series GPUs have good shared-memory bandwidth. This should not be the issue as this program uses shared-memory-tiling to compute particle-particle interactions.

Throughput of rsqrt was 1/4 of peak in old cards, is it same for Ada series too? Should any kind of Quake-fast-inverse-square-root implementation help in this scenario (especially for the integer capabilities of Ada) to balance the peak-flops / current-flops value?

Lastly, can anyone else with an Ada GPU try this for a comparison of GFLOP/s with the RTX4070 I’m testing?

njuffa · July 29, 2023, 7:41pm

See section 5.4.1 Arithmetic Instructions of the CUDA Programming Guide. For compute capability 8.6, 8.9, and 9.0, the throughput of the MUFU instructions is 16 versus 128 for single-precision arithmetic operations, so throughput is 1/8 instead of 1/4. Even so, it is not possible to replace MUFU.RSQ with a sequence of integer and FP32 arithmetic instructions with better performance at roughly identical accuracy. Example:

/* compute 1/sqrt(x) on [2**(-126), 2**128) */
__device__ float my_rsqrtf (float a)
{
    float r;
#if USE_NATIVE
    // maximum error = 1.51449 ulps
    asm ("rsqrt.approx.ftz.f32 %0,%1; \n\t" : "=f"(r) : "f"(a));
#else // USE_NATIVE
    // maximum error = 1.16684 ulps
    r = __int_as_float (0x5f37642f - (__float_as_int(a) >> 1));
    r = fmaf (0.5f * r, fmaf (a * r, -r, 1.0f), r);
    float e = fmaf (a * r, -r, 1.0f);
    r = fmaf (fmaf (0.375f, e, 0.5f), e * r, r);
#endif // USE_NATIVE
    return r;
}

If you want to explore the performance bottlenecks in the code, I would suggest using the CUDA profiler.

tugrul_192bit · July 29, 2023, 7:57pm

What about not replacing but load-balancing between MUFU & INT-FP32? Can we somehow get more than 100% of MUFU by using both MUFU and INT/FP32 with a modulo pattern (like using MUFU for threads with id 1,3,5,7,… and FP32 for threads with id 0,2,4,etc) or block-based modulo (first block uses MUFU, second block uses FP32, alternating)

Maybe, if dx,dy,dz values are normalized, can we have same accuracy with polynomial-approximation (pure FMA FP32) and use this to help the MUFU?

Can INT pipeline be used by scaling the FP values and rescaling back? FMA works on INT pipeline too? (to have INT + FP32 + MUFU together for higher throughput than just 1/8 of peak)

njuffa · July 29, 2023, 8:05pm

Profile. Experiment. The answers to your questions will reveal themselves.

Topic		Replies	Views
GPU performance is very poor General Topics and Other SDKs cuda , performance , windows-driver	0	1074	June 3, 2022
Mythical Tflops CUDA Programming and Performance	11	1058	January 14, 2019
Peformance comparison ends in strange results CUDA Programming and Performance	3	748	August 9, 2019
Ada GeForce (RTX 4090) FP8 cuBLASLt performance GPU-Accelerated Libraries cublas	7	10028	November 2, 2023
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37081	August 30, 2009
Estimating FFT Performance CUDA Programming and Performance	9	1523	June 4, 2010
Internal Profiling error - insufficient kernel bounds data CUDA Programming and Performance	8	4633	May 9, 2016
Verify cuda core peak fp32 performance CUDA Programming and Performance	10	360	May 2, 2024
FFT Speed vs. x86 CUDA Programming and Performance	14	24643	July 27, 2008
Laptop gpu choice CUDA Programming and Performance	3	2351	May 3, 2023

About CUDA nbody sample performance comparison

Related topics