Integer NTT on RTX 20xx, A100 vs RTX 30xx, 40xx, 50xx

@mjoux @Robert_Crovella and any others,

The Blackwell Integer thread:

https://forums.developer.nvidia.com/t/blackwell-integer/320578/164

states new users can only reply 3 times so I have to start this new thread for you.

In the Mersenne Forum thread mentioned there I posted this question and wanted to ask you too.

For PRPLL NTT, when using M31*M61 integer NTT, here are some timings (microseconds per iteration) from users for 140 million exponent:

RTX 2070

1523.8

RTX 2080 Ti

1119.6

RTX 4060

1693.6

RTX 4070 Super

923.2

RTX 4090

424.6

RTX 5090

235

Nvidia A100 40GB (data center GPU on Google Colab)

571.0

You can see the theoretical FP32 Teraflops for each on TechPowerUp or simply do the math on a calculator yourself so no need to state those here.

Here are the two questions for you:

(note you can Google Search Turing, Ampere, Ada Lovelace, and Blackwell architecture whitepapers on nvidia.com to see details)

  1. As the Nvidia whitepapers say, A100, RTX 20xx, 30xx, and 40xx allow simultaneous execution of FP32 and INT32 operations at full throughput, so you can mix and match FP32 and INT32. Thus is the reason RTX 20xx and A100 show better proportional results to their theoretical FP32 Teraflops is because all the CUDA Cores support either FP32 and INT32, so the PTX compiler can mix and match and optimize PRPLL NTT for maximum speed, whereas RTX 30xx and 40xx while you can also mix and match, all CUDA cores support FP32 but only half support INT32 so you don’t get that extra possible optimization? Or is there an optimization issue with the PTX compiler?

  2. As Nvidia employee mjoux stated in the Blackwell Integer thread, and as shown below from the text from the updated whitepaper on nvidia.com, while for RTX 50xx all CUDA cores support FP32 or INT32, only few of the INT32 instructions can run up to 2x throughput over Ada Lovelace (which is hard to do according to the Nvidia employee), and in addition all the cores have to run at the same time as either FP32 or INT32 (instead of mix and match), thus limiting optimization and causing RTX 50XX to behave like Ada Lovelace for PRPLL NTT in proportion to theoretical FP32 Teraflops?

The reason I am asking this is if my understanding is correct, you get cases like this where the older RTX 2070 GPU runs at a slower clock speed, has less CUDA cores, and half the theoretical FP32 Teraflops as the newer RTX 4060 yet is slightly faster with this integer NTT due to the architecture design. This would also explain why the data center A100 GPU is 3 times faster than RTX 4060 despite having 1.3x times the FP32 Teraflops value if my interpretation is correct.

Luckily A100 has strong FP64 (double-precision) so it can use FP64 FFT which is less than 25% faster than NTT. The design changes by Nvidia over the years before Pascal that resulted in weak FP64 on consumer GPUs necessitates using the new integer NTT for PRPLL, and looks like a case of deja vu with 2070 being faster than 4060. Of course there are reasons why this was done, pluses and minuses to architecture changes, etc. GIMPS was at the forefront using SSE2, FMA3/AVX2, AVX-512, etc. on the CPU side thanks to the expertise of George Woltman and others to speed up our project greatly so that was a benefit, but even then I remember reading when Intel increased Gigahertz speeds for Pentium 4 and introduced SSE2 and the architecture design caused legacy applications to suffer. Later CPU architecture thankfully addressed this so you can see both CPU and GPU architecture changes can have significant impact.

https://images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf

INT operation update added in v1.1 of this whitepaper >> Note that the number of possible integer operations in Blackwell GB20x GPUs are doubled for many integer instructions compared to Ada, by fully unifying the INT32 cores with the FP32 cores, as depicted in Figure 6 below. However, the unified cores can only operate as either FP32 or INT32 cores in any given clock cycle. While many common INT operations can run at up to 2x throughput, not all INT operations can attain 2x speedups. For more details, please refer to the NVIDIA CUDA Programming Guide.

I have no knowledge of the workload you’re discussing, but one factor not touched on, between the three cards mentioned in the above quote, is memory bandwidth.

The 2070 is 448GB/s, 4060 is 272GB/s and A100, 1.56TB/s. I suspect this is an important consideration.

[The number of instructions are for 2 cycles, as each SM has a core multiple of 16; with 32 threads per warp].

20x0 can mix one FP32 with one INT32.

30x0 (consumer) and 40x0 (+A100) should be about the same speed, they both can execute either two FP32 or mix one FP32 with one INT32.

I believe if you compare SM pre-50x0 to 50x0, it got slightly faster. But far not doubled in integer performance, when considering the reachable speed after optimization. So as long as you don’t compare to theoretical numbers, the architecture still slightly got better. It can execute two INT32 instructions (valid for most instructions).

So there is a steady improvement in SM performance (whether it is worth the power or IC real estate is a different tradeoff).

If you habe other observations, they probably have a different reason.

You can always run with Nsight Compute to see the ratio of maximum compute and memory throughput reached.

Looks like you are right about memory bandwidth - Yves Gallot gave the below response in Mersenne Forums - it is crazy 2070 is faster than 4060….

  1. 30x0/40x0 are twice as fast as 20x0 for FP32 because each SM can execute 128 FP32 (vs 64 for 20x0).

    But the code of PRPLL NTT is about 50% IMAD and 50% INT32. IMAD is executed on a FP32 unit. The key point is the concurrent execution of FP32 and INT32 instructions in the Turing SM.
    Each SM is able to execute 4 SIMD/cycle (SIMD is 32-lane wide). Because IMAD and INT32 run concurrently, each 20x0 SM executes 64 IMAD + 64 INT32. 30x0/40x0 also execute 64 IMAD + 64 INT32, but not because of concurrency but because the number of cores per SM is 128.

    The RTX 2070 has a faster iteration time than the RTX 4060 because of the data bandwidth (448.0 GB/s vs 272.0 GB/s). 140 million exponent is a huge number, larger than the 4060 L2 cache.

  2. The throughput of IADD is 2x but the throughput of IADD3 is 1x. IADD3 is equivalent to two IADD then there is no improvement for 32-bit instructions.
    The new 64-bit IADD instruction is an improvement for M61 code. But the code is not just 64-bit additions.

    If RTX 5090 is faster than RTX 4090, it is certainly because of GDDR7 and the 512-bit data bus, not because of the architecture of cores.

@rs277 @Curefab and everyone else,

I just posted the below on Mersenne Forum - I think the main difference is still architecture rather than memory bandwidth on why RTX 2070 is slightly faster than RTX 4060 (and likewise A100 being 3 times faster than RTX 4060).

Let me know your thoughts - be advised Nvidia Forums may be restrictive in me replying as a new user so hopefully I can respond.

–

I appreciate all your help - see below - I think I can better explain why architecture difference may be the true influence.

I am aware of memory bandwidth implications (140 million exponent is 4M FFT size for M31*M61 integer NTT on PRPLL NTT, thus 8 * 4M = 32MB memory needed which is bigger than RTX 4060 24MB L2 cache (RTX 2070 has only 4 MB L2 cache), and your architecture breakdowns are indeed true as shown in the whitepapers :)

You gave me better clarity of RTX 20xx and A100 architecture where I should of doubled the theoretical number I was basing RTX 20xx and A100 on, thus I can better pinpoint the architecture differences which explains the better performance. Besides memory bandwidth it looks like my theory was right based on architecture differences but I had the wrong interpretation. According to the below the vast majority of the difference is architecture not memory bandwidth.

Is my issue I was looking at solely the theoretical FP32 FLOPS value where I questioned everything why one GPU is faster than the other, where due to architecture differences I should instead focus on the combined FP32 OPS + INT32 OPS value when using 50% IMAD and 50% INT32 (in addition to any memory bandwidth impacts, such as the small memory bandwidth implications when 32MB spills out of RTX 4060 24MB cache)?

Here is the breakdown:

Page 63 of the Turing whitepaper below states RTX 2070 is:
36 Streaming Multiprocessors
64 CUDA Cores per Streaming Multiprocessor (Page 11 states ā€œeach SM has a total of 64 FP32 Cores and 64 INT32 Coresā€ and "The Turing SM supports concurrent execution of FP32 and
INT32 operations​"​
2304 CUDA Cores per GPU (36*64)
7.5 FP32 Teraflops
7.5 INT32 TIPs

https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf

My issue was I was focused on the 7.5 FP32 Teraflops value for RTX 2070 where for PRPLL NTT for integer NTT I should instead focus on FP32 + INT32 = 15 Tera Operations Per Second for 50% IMAD and 50% INT32.

For the Ada Lovelace whitepaper below, page 29 has the RTX 4090 values and not 4060, so I pasted the obvious 4060 values in parenthesis

RTX 4090 (RTX 4060)

128 Streaming Multiprocessors (24 for RTX 4060)
128 CUDA Cores per Streaming Multiprocessor
16384 CUDA Cores per RTX 4090 GPU (3072 for RTX 4060)
82.6 FP32 FLOPS (15.1 for RTX 4060)
41.3 INT32 FLOPS (7.6 for RTX 4060).

https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf

Thus even though 15.1 is FP32 Teraflops, the true value I should use is 50% of that number (7.6) for IMAD plus 7.6 for INT32 equals the same 15.1 Tera Operations Per Second which conveniently is the same exact value.

Now you see - both RTX 2070 and 4060 have the same 15 value (where the math error in my previous comparison was comparing 7.5 to 15), thus they have the same performance.
From a memory bandwidth side for 140 million exponent RTX 2070 spills more out of fast L2 cache into its GDDR6 memory compared to the RTX 4060 (so a benefit for 4060 there), but 2070 has better GDDR6 memory bandwidth compared to RTX 4060 once you spill into the memory.

So the 2070 being slightly faster than 4060 makes total sense with my entire breakdown above.

Same type of breakdown explains why A100 in Google Colab is almost 3 times faster than RX 4060.

Below is the A100 Ampere whitepaper.

https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf

Page 36 has the below for A100 40GB:

108 Streaming Multiprocessors
64 FP32 CUDA Cores per Streaming Multiprocessor
64 INT32 CUDA Cores per Streaming Multiprocessor
6912 FP32 CUDA Cores per GPU
6912 INT32 CUDA Cores per GPU
19.5 FP32 FLOPS
19.5 INT32 FLOPS

FP32 + INT32 = 19.5 +19.5 = 39 Tera Operations Per Second for 50% IMAD and 50% INT32.

Thus comparing A100 value of 39 to RTX 4060 value of 15 shows A100 is 2.6 times the power.
And on the memory bandwidth side, while A100 has superior 1.55 Terabytes per second memory bandwidth, it has 40MB L2 cache so 140 million exponent stays in cache, while 4060 slightly spills out.
Thus the A100 being 3 times faster than RTX 4060 makes perfect sense now.

George provided the data size for 140M exponent is actually 48MB rather than 32MB.

From George Woltman:

A 4M NTT for M31*M61 needs 4M*(4 bytes + 8 bytes) = 48MB.

PRPLL uses two buffers to do an NTT (unlike prime95, it is not an in-place implementation). You’ll need a 48MB cache to hold the two buffers of NTT data.

In one squaring, there are 4 passes over the NTT data. Each pass reads from one buffer and writes to the other buffer. Each pass reads and writes 48MB of data. A squaring reads and writes 192MB of data. Twiddle factors also require some storage and bandwidth.

For myself (wfgarnett3), I was wondering why Yves’ integer NTT genefer was reporting certain data sizes at my post below - George’s 48MB explanation for 4M NTT exactly matches the 48MB data size genefer shows for 4M NTT. Makes the fact M31*M61 integer NTT helps chop primality testing NTT size in half compared to M61 integer NTT so beneficial :)

https://www.primegrid.com/forum_thread.php?id=11303&nowrap=true#182665

Here is the perfect TechPowerUp internet thread where users talk about the same thing I mentioned.

https://www.techpowerup.com/forums/threads/ampere-v-s-turing.290508/

As the one user mentions, with 50% FP32 /50% INT32 code split you get the extreme case like with PRPLL NTT where its like the RTX 2070 has 2*2304 = 4608 cores which corresponds to the 15 Tera Operations a second I mentioned.

So with 48MB data size rather than 32MB I stated there is more of a memory bandwidth impact but that post shows the architecture difference by a factor of 2 I am stating.

@mjou @Robert_Crovella - As posters mention blame Nvidia marketing for the difference of what a CUDA Core is between architectures.

I think NVIDIA’s definition of a ā€œCUDA coreā€ is that entity (SM functional unit) which can do a FMUL, FADD, or FFMA operation (32-bit floating point non-tensor ops). As far as that goes, it has been in effect as long as I can remember, at least since 2010.

I’m not aware of any situation or usage by NVIDIA where that definition has not been met.

Perhaps you are referring to ā€œwhat else could possibly be scheduled on a CUDA core besides those particular SASS instructions?ā€

I don’t really know the answer to that. I don’t see that as affecting the definition of a core. According to the definition I have given, I believe NVIDIA’s usage is consistent.

As mentioned below, ā€œcoreā€ by itself is used by NVIDIA in other contexts/meanings, such as ā€œINT8 Tensor Coreā€ or ā€œINT32 Coreā€. In my view that usage is distinct from the ā€œCUDA coreā€ terminology that has been used since 2010 or before.

1 Like

Just to remark or add: Nvidia has been using the terms FP32 core and INT32 core, too, e.g. here:

Yes, I agree, and from a marketing perspective, you can see a proliferation of the use of the word ā€œcoreā€, such as here.

For my previous comment, I would say it for the use of ā€œcoreā€ by itself, as a designator. It seems self-evident to me to say that the ā€œcoreā€ I am referring to is for example not the one that is ā€œINT8 Tensor Coreā€

I was responding to this sort of terminology:

There is a lot of architectural speculation in this thread; a fair bit which doesn’t match the hardware. For a kernel that is 50% IMAD and 50% INT32 Volta - GB20x have the same instruction throughput. There are many other changes in the SM and memory system that likely contribute to the performance differences; the numbers look more closely correlated the memory system than INT32 throughput.

Instead of speculating I encourage you to run the program through Nsight Compute and use the baseline/diff feature to compare reports from multiple architectures.

1 Like

Greg said, the same instruction for this INT32-IMAD instruction mix throughout Volta-GB200.

But a different instruction mix may have changed in performance over the architectures. So it will complicate rather than simplify to compare based on measured FLOP numbers for different workloads than yours.

So rather than relate or predict from those or advertised ā€˜Cuda Cores’, which are more closely aligned with FP32 capability, use the frequency - decide: base or boost - and the number of SMs in the GPU as sole factors for instruction performance.

This still excludes the mentioned effect of the memory system.

I am going to add some more speculation here: The volunteers that report performance data back to the project probably have just one particular GPU at their disposal and are willing to run a set benchmark on it, but they may not even be CUDA developers familiar with operating the CUDA profiler.

Unavoidable side effect of NVIDIA’s chosen course of action, IMHO. Nature abhors a vacuum, and the easiest way to fill the void is with speculation.

You are make good points, and we are all on the same page.

Yves Gallot (who created genefer at PrimeGrid, which recently found the 6th largest prime number and his integer NTT code was adapted for our use by George Woltman for PRPLL (created by Mihai Preda and George Woltman)) where the older FP64 gpuowl / PRPLL version found largest prime number in October 2024 on A100 by former Nvidia employee Luke Durant ( www.mersenne.org ) posted the below response to me that I put at the bottom here.

That was all I was saying - the FP32 7.5 Teraflops of RTX 2070 vs FP32 15 Teraflops of RTX 4060 doesn’t mean that 4060 should have the twice performance with all other things being equal for our testing - it’s a useless comparison.

We both agree they have almost equal performance of integer units (as my architecture breakdown shows for 50% FP32 and 50% INT32 code) and the better memory bandwidth of the 2070 wins out for 140 million exponent.

You can see he says exactly the same thing in response to my A100 genefer timings in comparison to RTX 5080 at this PrimeGrid post here:

https://www.primegrid.com/forum_thread.php?id=11303&nowrap=true#182666

I was initially shocked why RTX 2070 is faster than RTX 4060 for PRPLL NTT when I saw the theoretical FP32 (already taking into account superior memory bandwidth of RTX 2070). Once I did the architecture breakdown now it all makes sense.

So like I posted in my first post, architecture changes have big impacts.

Pentium 4 with higher frequency and SSE2 (which George adapted for our project and sped things up considerably, just like AVX2/FMA3, AVX-512, etc. later) was good for us but Pentium 4 ran older code worse.

Consumer Nvidia GPUs had a better FP64 performance in the past, and as you see with the FP64:FP32 ratio for GPUs on TechPowerUp and the Nvidia whitepapers now they have much weaker FP64 power with FP64:FP32 ratios of 1:64 (Blackwell/Ada Lovelace) - even the older Turing was slightly better but bad at 1:32. (AMD does the same now with its consumer GPUs). That was why the GIMPs project recommended consumer Nvidia and AMD GPUs to do trial factoring only that used integer units; FP64 performance is terrible.

Turing (20xx) brought true 32-bit integer multiplication over Pascal !! - so that is a step forward for Nvidia.

Compute Capability 6.1 (GeForce 10): SM = 128 INT32/FP32 + 4 FP64. Four 16-bit MAD instructions are needed to emulate a 32-bit MUL.
Compute Capability 7.5 (GeForce 20): SM = 64 FP32 + 64 INT32 + 2 FP64. True 32-bit MUL instruction (throughput 1 clock).

But then Nvidia does it again and later consumer architectures (Ampere 30xx, Ada Lovelace 40xx, Blackwell 50xx) got rid of concurrent FP32 and INT32 possible on every ā€œCUDA coreā€ so a step back for our PRPLL NTT case of 50% INT32 and 50% FP32 code.

And then Nvidia does it again in Blackwell. It took all of us to force Nvidia to admit at this Blackwell Integer thread that their marketing materials, whitepapers, etc. were wrong concerning 2x INT32 over Ada Lovelace and our efforts forced Nvidia employee @mjoux in the thread to admit 2x only applied to limited instructions and is hard to do, and Nvidia themselves modified their whitepaper with version 1.1 and this disclaimer. All the old articles on websites like Toms Hardware still brag about 2x INT32 performance over Ada Lovelace so Nvidia is not really being truthful in its presentations.

https://forums.developer.nvidia.com/t/blackwell-integer/320578

At the top of Nvidia’s Compare Graphics Cards page, it currently says Streaming Multiprocessors 2xFP32 for 30xx,40xx, 50xx vs 1xFP32 for 20xx,16xx,10xx, but as this thread shows that is just marketing speak to try to hype something up.

https://www.nvidia.com/en-us/geforce/graphics-cards/compare/

So we have weaker FP64 now compared to previously, while consumer 20xx and data center Ampere A100 have concurrent FP32 and INT32 that went away now with later architectures, and then the new Blackwell tries to sugarcoat its INT32 performance. Yes there are new benefits like AI, etc. but a lot of performance is going away as architectures advance…

From Yves Gallot on Mersenne Forum:

The ā€œdata sizeā€ of genefer is data + twiddle factors.

For n <= 2^21, the NTT is implemented with three 32-bit primes. The size of data is n * 4 * 3 and the size of twiddle factors is n * 4 * 3 / 2. If n = 2^21, we have n * 4 * 3 = 24 MB (read-write) + n * 4 * 3 / 2 = 12 MB (read-only) => data size: 36 MB.

For n >= 2^22, b is smaller and the NTT is implemented using two 32-bit primes. The size of data is n * 4 * 2 and the size of twiddle factors is n * 4 * 2 / 2. If n = 2^22, we have n * 4 * 2 = 32 MB (read-write) + n * 4 * 2 / 2 = 16 MB (read-only) => data size: 48 MB.

RTX 4060: L2 cache 24 MB, bandwidth 272.0 GB/s​
RTX 2070: L2 cache 4 MB​, bandwidth 448.0 GB/s
We can expect that ā€œsmallā€ numbers (data size <= 24 MB) are faster on the 4060 and ā€œbigā€ numbers (data size > 24 MB) are faster on the 2070 (considering almost equal performance of integer units).

One could agree with lots of your statements, but here you are not right:

But then Nvidia does it again and later consumer architectures (Ampere 30xx, Ada Lovelace 40xx, Blackwell 50xx) got rid of concurrent FP32 and INT32 possible on every ā€œCUDA coreā€ so a step back for our PRPLL NTT case of 50% INT32 and 50% FP32 code.

Before (Turing) you had 64 INT32 + 64 FP32.

With Ampere you get 64 both + 64 FP32. That is clearly a speed-up, just not for your instruction mix. So no step back.

There are several reasons for further diminished FP64 performance. One is that often they can now simulated faster with FP32. So native support on consumer cards is seldom used anymore.

In many cases Turing cards had better Tensor Cores than Ampere. And later on the performance of Tensor Cores was mostly increased with data center GPUs in recent architectures.

The 4060 has around 75% more transistors than the 2070 and faster clock speeds, but 33% less SMs. The transistors mostly went into the large L2 cache.

All this is publicly known. No deception from Nvidia there.

The 2xINT32 for Blackwell was a bit unclear. And in the end had less of an advantage not only because not all instructions can be combine, but also because the architectures before were better than advertised.

@Curefab - as quoted by @Robert_Crovella in 2019 over here:

https://forums.developer.nvidia.com/t/is-it-possible-to-have-fp-unit-and-int-unit-in-a-same-core-work-in-parallel/71086

ā€œwith respect to 32-bit integer arithmetic, all current GPUs have dedicated integer add units
Kepler, Volta, and Turing have dedicated integer multiply unitsā€

ā€œNVIDIA certainly marketed simultaneous use of INT32 and FP32 cores in Volta.ā€

ā€œUnlike Pascal GPUs, which could not execute FP32 and INT32 instructions simultaneously, the Volta GV100 SM includes separate FP32 and INT32 cores, allowing simultaneous execution of FP32 and INT32 operations at full throughput, while also increasing instruction issue throughput.ā€

Turing gets full throughput for both FP32 and INT32 simultaneously - so like in PRPLL NTT you can simply ā€œcombineā€ in a sense the theoretical FP32 Teraflops and INT32 Tera Integer Operations per second for 50% FP32 / 50% INT32 code.

Yes, if code is 100% FP32 you can simply look at the theoretical FP32 Teraflops across architectures to compare.

The 64 both + 64 FP32 in consumer Ampere (30xx) you state is still not a speedup for FP32 only code - it’s just a different architecture - it just means you can compare theoretical FP32 Teraflop values across architectures like consumer Turing, Ampere, Ada Lovelace, etc. The architecture did change but no you cannot execute both FP32 and INT32 for each ā€œCuda Coreā€ on the half of the cores that support both FP32 and INT32 - you have to choose one.

Contrast that with consumer Turing (RTX 20xx) and data center Ampere (A100) where for 100% FP32 only code yes you can still compare theoretical FP32 Teraflops across architectures, but these 2 are special in that each ā€œCuda Coreā€ can run both INT32 and FP32 at full throughput, so like that TechPowerUp Internet thread says for 50% FP32 / 50 % INT32 code its almost like the RTX 2070 has 2*2304=4608 processing cores (instead of the 2304 Cuda Core value) compared to 3072 processing cores of RTX 4060. The half (64 out of of every 128) ā€œCuda Coresā€ on the RTX 4060 that support INT32 and FP32 do not support using both the same time on each ā€œCuda Coreā€., while consumer Turing and data center Ampere do support on all their cores.

See how Nvidia confuses things! - you cannot compare Cuda Cores across architectures but on their marketing pages they do - this is valid for FP32 only. Yes for FP32 only code you can simply compare core counts but the whole 2x FP32 per streaming multiprocessor on their compare page is disingenuous. Yes, the streaming multiprocessor for Ada Lovelace RTX 4060 has 128 Cuda Cores in a SM where all support FP32 while Turing RTX 2070 has 64 Cuda Cores in a SM where all support FP32, but that is just the design of a multiprocessor. Nvidia can’t contradict itself on its compare page, if you want to list the Cuda Core counts like you did that is perfectly fine and valid to compare FP32 performance through architectures (after you factor in Gigahertz speed and a factor of 2 for FMA to get the theoretical FP32 Teraflops), but no there is no 2 times speedup for FP32 only code; just the design of the GPU and the design of what a streaming multiprocessor is changed. In my expert opinion it is a backwards step for the RTX 4060 to be slower than RTX 2070 which is 2 generations prior for PRPLL NTT; just like the weaking of FP64 over the years (as everyone knows if you want strong FP64 you have to use data center GPUs), how the RTX 4060 and 5060 have only 8GB GPU memory for gaming while 3060 had a 12GB version, etc.

Turing Whitepaper:

https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf

The GeForce RTX 2080 Ti Founders Edition GPU delivers the following exceptional computational
performance:

  • 14.2 TFLOPS of peak single precision (FP32) performance
  • 14.2 TIPS concurrent with FP, through independent integer execution unit

First, the Turing SM adds a new independent integer datapath that can execute
instructions concurrently with the floating-point math datapath. In previous generations,
executing these instructions would have blocked floating-point instructions from issuing.

The Turing architecture features a new SM design that incorporates many of the features
introduced in our Volta GV100 SM architecture. Two SMs are included per TPC, and each SM has
a total of 64 FP32 Cores and 64 INT32 Cores.

The Turing SM supports concurrent execution of FP32 and
INT32 operations (more details below), independent thread scheduling similar to the Volta
GV100 GPU.

Turing implements a major revamping of the core execution datapaths. Modern shader
workloads typically have a mix of FP arithmetic instructions such as FADD or FMAD with simpler
instructions such as integer adds for addressing and fetching data, floating point compare or
min/max for processing results, etc. In previous shader architectures, the floating-point math
datapath sits idle whenever one of these non-FP-math instructions runs. Turing adds a second
parallel execution unit next to every CUDA core that executes these instructions in parallel with
floating point math.

It is a speedup for FP32 only (or heavy) code on Ampere (consumer).

Turing could execute 64 FP32 FMA instructions per cycle per SM, Ampere can execute 128. That is a doubling.

So the Ampere architecture is for any workload (INT32 or FP32) at least as good as Turing (in regards to execution units).

Due to the schedulers 128 is a maximum (4 partitions with 32 threads per warp).

With the 2070 (36) and the 4060 (24) you are comparing GPUs with different numbers of SMs.

The overall Volta architecture (introduced as Turing in consumer GPUs) stayed the same until now with Blackwell. It has only been tweaked here and there. The previous architectures (like Maxwell and Pascal) were much different - more complex and less efficient.

Some critique of yours is justified, but that the architecture got worse is just outright wrong.

Yeah I am already in agreement with you - if you compare GPUs with the exact same number of SMs yes there is a doubling of FP32.

RTX 2070 has 2304 Cuda Cores (36 SMs)

RTX 5060 TI has double the Cuda Cores with 4608 Cuda Cores (36 SMs)

Keeping clock speed and everything else equal, yes 5060 TI gets double the all FP32 performance, the same all INT32 performance, and the same 50% FP32 + 50% INT32 code performance compared to RTX 2070.

But GPUs are marketed by CUDA Cores so if 5060 Ti had the same number of 2304 Cuda Cores it would have the same all FP32 performance, half the all INT32 performance, and half the 50% FP32 + 50% INT32 code performance.

Cuda Core comparison is only valid within the same architecture and not across architectures. @Robert_Crovella maybe Nvidia can add a footnote next to the Cuda Cores rows on the Compare Page here with this exact wording.

https://www.nvidia.com/en-us/geforce/graphics-cards/compare/

Your post Curefab below is true since instruction mix changes complicate comparisons.

ā€œBut a different instruction mix may have changed in performance over the architectures. So it will complicate rather than simplify to compare based on measured FLOP numbers for different workloads than yours.

So rather than relate or predict from those or advertised ā€˜Cuda Cores’, which are more closely aligned with FP32 capability, use the frequency - decide: base or boost - and the number of SMs in the GPU as sole factors for instruction performance.

This still excludes the mentioned effect of the memory system.ā€

1 Like

I got timings from some Mersenne Forum users including the user who owned the RTX 2070 (3.375MB L1, 4MB L2 cache) to compare against my desktop RTX 4060 (3MB L1 cache, 24MB L2 cache).

It is indeed true like everyone alludes too - my 4060 is slightly faster than the 2070 at small data sizes until the data size hits my L2 cache limit where the 2070 is slightly faster through the bigger data sizes due to the better 448 GB/sec vs 272 GB/sec memory bandwidth. Still the timings throughout are close - the cards have similar performance in PRPLL NTT. The RTX 5090 is the ultimate beast though - 7 times faster than my 4060 testing a world record 140 million exponent Mersenne number (over 9 hours for a one in a million chance to be a prime number - 4M integer NTT), then for bigger data sizes that spills out of its huge L1 and L2 caches that 7 ratio drops down probably because memory bandwidth ratio between the 2 cards is smaller and ā€œonlyā€ 6.59x - LOL - (1792 GB/sec vs 272 GB/sec).

Is the reason the A100 with its awesome memory bandwidth (1555 GB/sec) doesn’t have a bigger ratio over the 4060 when the 4060 spills out its cache because it doesn’t have enough Operations Per Second to feed it?

Timings - Microseconds Per Iteration for PRPLL NTT

M31*M61 NTT 256K 512K 1M 2M 4M 8M 16M
Data Size 3MB 6MB 12MB 24MB 48MB 96MB 192MB
RTX 2070 124.6 201.3 385.8 797.2 1520.6 3006.1 6426
RTX 4060 115 169.6 320.6 823.3 1693.6 3451 7050.7
Ratio 0.92295345 0.8425236 0.83100052 1.03274 1.113771 1.147999 1.097214
A100 40GB 78.7 102.9 146.7 297.1 569.3 1095.8 2206
RTX 4060 115 169.6 320.6 823.3 1693.6 3451 7050.7
Ratio 1.46124524 1.64820214 2.18541241 2.771121 2.974881 3.149297 3.196147
RTX 5090 43.1 59.5 83.6 136.3 241.1 571.8 1215.9
RTX 4060 115 169.6 320.6 823.3 1693.6 3451 7050.7
Ratio 2.66821346 2.85042017 3.83492823 6.040352 7.024471 6.035327 5.79875

I have a question for any who can help with regards to Turing (RTX 2070).

At the Turing Tuning Guide here:

https://docs.nvidia.com/cuda/turing-tuning-guide/index.html

it states ā€œInstructions are performed over two cycles, and the schedulers can issue independent instructions every cycle. Dependent instruction issue latency for core FMA math operations is four clock cycles, like Volta, compared to six cycles on Pascal.ā€ It also states ā€œSimilar to Volta, the Turing SM includes dedicated FP32 and INT32 cores. This enables simultaneous execution of FP32 and INT32 operations. Applications can interleave pointer arithmetic with floating-point computations. For example, each iteration of a pipelined loop could update addresses and load data for the next iteration while simultaneously processing the current iteration at full FP32 throughput.ā€

Over at the Turing Whitepaper here:

https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf

It states for the RTX 2080 Ti Founders Edition:

  • 14.2 TFLOPS of peak single precision (FP32) performance
  • 14.2 TIPS concurrent with FP, through independent integer execution units

Does 2 cycles rather than 1 cycle contradict the RTX 2080 Ti Founders Edition concurrent 14.2 TFLOPS + 14.2 TIPS peak performance?

Since Volta the SMs are partitioned into 4 Partitions. Each warp (32 threads that execute together ideally with the same instruction pointer) resides on a fixed partition. A (at least one) Cuda block (threads which can share shared memory) runs on one SM, but its warps may be (and usually are) divided over the 4 partitions to balance resource use and increase the computation power for each block.

If you distribute the 64 INT32 and the 64 FP32 execution units (ā€˜cores’) of the RTX 2070 into the 4 partitions you get 16+16. So a warp with 32 threads needs 2 cycles to feed in (dispatch) a whole warp. The execution units are a pipeline, so several instructions are computed in different stages at the same time. The pipeline depth for those arithmetic instructions seems to be (according to your citations) 4 cycles. But for each execution unit each cycle a new computation may start and another one may finish (regardless of the pipeline depth).

Each partition can schedule one instruction per cycle (there is one scheduler per partition). This counts only the starting of the instruction. So to schedule an instruction for 32 threads over 2 cycles only counts as one scheduling operation. During the time the second 16 threads are dispatched, another instruction may be (started to be) scheduled to the other execution unit type. So in maximum load FP32 and INT32 are scheduled and dispatched in an interleaved fashion.

Those more detailed views don’t have an effect on the performance figures. It does not change the number of execution units or their maximum throughput.

@Curefab - perfect, thanks so much for that explanation!

For everyone’s info, I posted Yves response - here it is.

Also I posted his note about FFT compared to NTT.

–

ā€œare performed over two cyclesā€ is not clear but I understand that latency (INT32 instructions) is 2 cycles and throughput is 1 cycle. And latency is 4 cycles for FP32 FMA and certainly about the same for IMAD.
Since performance = throughput, there is no inconsistency.

–

The reason is that A100 is not memory limited.

If NTT size = 1M, data size < L2 and timing is 146.7 µsec. Timing for 16M is 2206 µsec.
We have 2206/146.7 ~ 15 < 16.

For the RTX 4060, we have 320.6 µsec at 1M and 7050.7 at 16M and 7050.7/320.6 ~ 22. Because of the mem bandwidth, the RTX 4060 is 1 - 16/22 = 27% slower.

Another method to see it, is to compute the real bandwidth.
For 16M, it is 16M * 12 * 4 * 2 (size * 64+32 bits * 4 passes * read/write) / iter/sec.
A100: 161242/10241e6/2206.0 = 680 GB/s
4060: 161242/10241e6/7050.7 = 213 GB/s​

The RTX 4060 is limited by its mem bandwidth (272 GB/s) but the mem bandwidth of the A100 is too much, 800 GB/sec would be enough.

I think that the RTX 5090 does not perform well compare to the RTX 4060 because of its L2 cache size.
Theoretical performance is 7x but L2 cache size is 4x. If the 5090 L2 cache size was 168 MB, the ratio would be 7 and not 5.8 for 16M.

–

But the code of PRPLL NTT is about 50% IMAD and 50% INT32. IMAD is executed on a FP32 unit.

–

The primary operation for a FFT is z = t * x + y, where x, y, z, t are FP64, which is a FMA.
The primary operation for a NTT is c = t * a + b mod n, where a, b, c, t are 64-bit integers and n = M61 (or 32-bit integers and n = M31).
c = t * a + b mod 2^61 - 1 requires about ten INT32 instructions.

FFT is expected to be faster if FP64:INT32 >= 1:8 and NTT is faster if FP64:INT32 <= 1:16.

–

Nov. 2 builds from George for anyone to play with.

Linux:

https://www.dropbox.com/scl/fi/e124vq5100znxbhbwuiqm/prpll?rlkey=1enkyc6qrlzshqlkdmkz2dqj3&dl=0

Windows:

https://www.dropbox.com/scl/fi/ui6nslguz9ici5v86fmqv/prpll-win.zip?rlkey=pkrxpmq5s7n1xcxhgrw66mkse&dl=0

Github:

https://github.com/gwoltman/gpuowl

Type prpll.exe -h for help. Start off by running prpll.exe -tune

Depending on if FP64 FFT or integer NTT is faster it will then create config.txt and tune.txt

The prefix in front of the fft like 1:256:2:256 that you see at prpll.exe -h tells you what type. All can be used, and the tuning finds the best one to use on your specific GPU (Nvidia, AMD, Intel, etc.) for 0 through 4 prefixes as you increase in exponent size. Here they are, including hidden ones not fully developed

0 or no prefix is old FP64 FFT (FFT64)
1 is M31+M61 NTT (FFT3161)
2 is FP32+M61 hybrid FFT. (FFT3261)
3 is M61 NTT. (FFT61)
4 is FFT323161
50 is FFT3231
51 is FFT6431
52 is FFT31
53 is FFT32 (FP32) - it’s broken but I see it works for exponent 1000003, for instance with 53:256:2:256 (256K) FP64 is preferred over FP32 because the bits per word for FP32 is too small for our size exponents and FP32 requires larger transform sizes.

Current wavefront testing is LL (double check for old) 80 million exponents (prefix 1 with 2M have best speed) and PRP (new primality check with proof and error protection) 140 million exponents (prefix 1 with 4M have best speed).

For the project https://www.mersenne.org/ you need to use AutoPrimeNet https://download.mersenne.ca/mirror/AutoPrimeNet/quick-start-guide.html to get exponents.