@mjoux @Robert_Crovella and any others,
The Blackwell Integer thread:
https://forums.developer.nvidia.com/t/blackwell-integer/320578/164
states new users can only reply 3 times so I have to start this new thread for you.
In the Mersenne Forum thread mentioned there I posted this question and wanted to ask you too.
For PRPLL NTT, when using M31*M61 integer NTT, here are some timings (microseconds per iteration) from users for 140 million exponent:
RTX 2070
1523.8
RTX 2080 Ti
1119.6
RTX 4060
1693.6
RTX 4070 Super
923.2
RTX 4090
424.6
RTX 5090
235
Nvidia A100 40GB (data center GPU on Google Colab)
571.0
You can see the theoretical FP32 Teraflops for each on TechPowerUp or simply do the math on a calculator yourself so no need to state those here.
Here are the two questions for you:
(note you can Google Search Turing, Ampere, Ada Lovelace, and Blackwell architecture whitepapers on nvidia.com to see details)
-
As the Nvidia whitepapers say, A100, RTX 20xx, 30xx, and 40xx allow simultaneous execution of FP32 and INT32 operations at full throughput, so you can mix and match FP32 and INT32. Thus is the reason RTX 20xx and A100 show better proportional results to their theoretical FP32 Teraflops is because all the CUDA Cores support either FP32 and INT32, so the PTX compiler can mix and match and optimize PRPLL NTT for maximum speed, whereas RTX 30xx and 40xx while you can also mix and match, all CUDA cores support FP32 but only half support INT32 so you donāt get that extra possible optimization? Or is there an optimization issue with the PTX compiler?
-
As Nvidia employee mjoux stated in the Blackwell Integer thread, and as shown below from the text from the updated whitepaper on nvidia.com, while for RTX 50xx all CUDA cores support FP32 or INT32, only few of the INT32 instructions can run up to 2x throughput over Ada Lovelace (which is hard to do according to the Nvidia employee), and in addition all the cores have to run at the same time as either FP32 or INT32 (instead of mix and match), thus limiting optimization and causing RTX 50XX to behave like Ada Lovelace for PRPLL NTT in proportion to theoretical FP32 Teraflops?
The reason I am asking this is if my understanding is correct, you get cases like this where the older RTX 2070 GPU runs at a slower clock speed, has less CUDA cores, and half the theoretical FP32 Teraflops as the newer RTX 4060 yet is slightly faster with this integer NTT due to the architecture design. This would also explain why the data center A100 GPU is 3 times faster than RTX 4060 despite having 1.3x times the FP32 Teraflops value if my interpretation is correct.
Luckily A100 has strong FP64 (double-precision) so it can use FP64 FFT which is less than 25% faster than NTT. The design changes by Nvidia over the years before Pascal that resulted in weak FP64 on consumer GPUs necessitates using the new integer NTT for PRPLL, and looks like a case of deja vu with 2070 being faster than 4060. Of course there are reasons why this was done, pluses and minuses to architecture changes, etc. GIMPS was at the forefront using SSE2, FMA3/AVX2, AVX-512, etc. on the CPU side thanks to the expertise of George Woltman and others to speed up our project greatly so that was a benefit, but even then I remember reading when Intel increased Gigahertz speeds for Pentium 4 and introduced SSE2 and the architecture design caused legacy applications to suffer. Later CPU architecture thankfully addressed this so you can see both CPU and GPU architecture changes can have significant impact.
INT operation update added in v1.1 of this whitepaper >> Note that the number of possible integer operations in Blackwell GB20x GPUs are doubled for many integer instructions compared to Ada, by fully unifying the INT32 cores with the FP32 cores, as depicted in Figure 6 below. However, the unified cores can only operate as either FP32 or INT32 cores in any given clock cycle. While many common INT operations can run at up to 2x throughput, not all INT operations can attain 2x speedups. For more details, please refer to the NVIDIA CUDA Programming Guide.