Integer NTT on RTX 20xx, A100 vs RTX 30xx, 40xx, 50xx

@Curefab - as quoted by @Robert_Crovella in 2019 over here:

https://forums.developer.nvidia.com/t/is-it-possible-to-have-fp-unit-and-int-unit-in-a-same-core-work-in-parallel/71086

“with respect to 32-bit integer arithmetic, all current GPUs have dedicated integer add units
Kepler, Volta, and Turing have dedicated integer multiply units”

“NVIDIA certainly marketed simultaneous use of INT32 and FP32 cores in Volta.”

“Unlike Pascal GPUs, which could not execute FP32 and INT32 instructions simultaneously, the Volta GV100 SM includes separate FP32 and INT32 cores, allowing simultaneous execution of FP32 and INT32 operations at full throughput, while also increasing instruction issue throughput.”

Turing gets full throughput for both FP32 and INT32 simultaneously - so like in PRPLL NTT you can simply “combine” in a sense the theoretical FP32 Teraflops and INT32 Tera Integer Operations per second for 50% FP32 / 50% INT32 code.

Yes, if code is 100% FP32 you can simply look at the theoretical FP32 Teraflops across architectures to compare.

The 64 both + 64 FP32 in consumer Ampere (30xx) you state is still not a speedup for FP32 only code - it’s just a different architecture - it just means you can compare theoretical FP32 Teraflop values across architectures like consumer Turing, Ampere, Ada Lovelace, etc. The architecture did change but no you cannot execute both FP32 and INT32 for each “Cuda Core” on the half of the cores that support both FP32 and INT32 - you have to choose one.

Contrast that with consumer Turing (RTX 20xx) and data center Ampere (A100) where for 100% FP32 only code yes you can still compare theoretical FP32 Teraflops across architectures, but these 2 are special in that each “Cuda Core” can run both INT32 and FP32 at full throughput, so like that TechPowerUp Internet thread says for 50% FP32 / 50 % INT32 code its almost like the RTX 2070 has 2*2304=4608 processing cores (instead of the 2304 Cuda Core value) compared to 3072 processing cores of RTX 4060. The half (64 out of of every 128) “Cuda Cores” on the RTX 4060 that support INT32 and FP32 do not support using both the same time on each “Cuda Core”., while consumer Turing and data center Ampere do support on all their cores.

See how Nvidia confuses things! - you cannot compare Cuda Cores across architectures but on their marketing pages they do - this is valid for FP32 only. Yes for FP32 only code you can simply compare core counts but the whole 2x FP32 per streaming multiprocessor on their compare page is disingenuous. Yes, the streaming multiprocessor for Ada Lovelace RTX 4060 has 128 Cuda Cores in a SM where all support FP32 while Turing RTX 2070 has 64 Cuda Cores in a SM where all support FP32, but that is just the design of a multiprocessor. Nvidia can’t contradict itself on its compare page, if you want to list the Cuda Core counts like you did that is perfectly fine and valid to compare FP32 performance through architectures (after you factor in Gigahertz speed and a factor of 2 for FMA to get the theoretical FP32 Teraflops), but no there is no 2 times speedup for FP32 only code; just the design of the GPU and the design of what a streaming multiprocessor is changed. In my expert opinion it is a backwards step for the RTX 4060 to be slower than RTX 2070 which is 2 generations prior for PRPLL NTT; just like the weaking of FP64 over the years (as everyone knows if you want strong FP64 you have to use data center GPUs), how the RTX 4060 and 5060 have only 8GB GPU memory for gaming while 3060 had a 12GB version, etc.

Turing Whitepaper:

https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf

The GeForce RTX 2080 Ti Founders Edition GPU delivers the following exceptional computational
performance:

  • 14.2 TFLOPS of peak single precision (FP32) performance
  • 14.2 TIPS concurrent with FP, through independent integer execution unit

First, the Turing SM adds a new independent integer datapath that can execute
instructions concurrently with the floating-point math datapath. In previous generations,
executing these instructions would have blocked floating-point instructions from issuing.

The Turing architecture features a new SM design that incorporates many of the features
introduced in our Volta GV100 SM architecture. Two SMs are included per TPC, and each SM has
a total of 64 FP32 Cores and 64 INT32 Cores.

The Turing SM supports concurrent execution of FP32 and
INT32 operations (more details below), independent thread scheduling similar to the Volta
GV100 GPU.

Turing implements a major revamping of the core execution datapaths. Modern shader
workloads typically have a mix of FP arithmetic instructions such as FADD or FMAD with simpler
instructions such as integer adds for addressing and fetching data, floating point compare or
min/max for processing results, etc. In previous shader architectures, the floating-point math
datapath sits idle whenever one of these non-FP-math instructions runs. Turing adds a second
parallel execution unit next to every CUDA core that executes these instructions in parallel with
floating point math.