Integer NTT on RTX 20xx, A100 vs RTX 30xx, 40xx, 50xx

wfgarnett3 · November 5, 2025, 7:01pm

@Curefab - as quoted by @Robert_Crovella in 2019 over here:

https://forums.developer.nvidia.com/t/is-it-possible-to-have-fp-unit-and-int-unit-in-a-same-core-work-in-parallel/71086

“with respect to 32-bit integer arithmetic, all current GPUs have dedicated integer add units
Kepler, Volta, and Turing have dedicated integer multiply units”

“NVIDIA certainly marketed simultaneous use of INT32 and FP32 cores in Volta.”

“Unlike Pascal GPUs, which could not execute FP32 and INT32 instructions simultaneously, the Volta GV100 SM includes separate FP32 and INT32 cores, allowing simultaneous execution of FP32 and INT32 operations at full throughput, while also increasing instruction issue throughput.”

Turing gets full throughput for both FP32 and INT32 simultaneously - so like in PRPLL NTT you can simply “combine” in a sense the theoretical FP32 Teraflops and INT32 Tera Integer Operations per second for 50% FP32 / 50% INT32 code.

Yes, if code is 100% FP32 you can simply look at the theoretical FP32 Teraflops across architectures to compare.

The 64 both + 64 FP32 in consumer Ampere (30xx) you state is still not a speedup for FP32 only code - it’s just a different architecture - it just means you can compare theoretical FP32 Teraflop values across architectures like consumer Turing, Ampere, Ada Lovelace, etc. The architecture did change but no you cannot execute both FP32 and INT32 for each “Cuda Core” on the half of the cores that support both FP32 and INT32 - you have to choose one.

Contrast that with consumer Turing (RTX 20xx) and data center Ampere (A100) where for 100% FP32 only code yes you can still compare theoretical FP32 Teraflops across architectures, but these 2 are special in that each “Cuda Core” can run both INT32 and FP32 at full throughput, so like that TechPowerUp Internet thread says for 50% FP32 / 50 % INT32 code its almost like the RTX 2070 has 2*2304=4608 processing cores (instead of the 2304 Cuda Core value) compared to 3072 processing cores of RTX 4060. The half (64 out of of every 128) “Cuda Cores” on the RTX 4060 that support INT32 and FP32 do not support using both the same time on each “Cuda Core”., while consumer Turing and data center Ampere do support on all their cores.

See how Nvidia confuses things! - you cannot compare Cuda Cores across architectures but on their marketing pages they do - this is valid for FP32 only. Yes for FP32 only code you can simply compare core counts but the whole 2x FP32 per streaming multiprocessor on their compare page is disingenuous. Yes, the streaming multiprocessor for Ada Lovelace RTX 4060 has 128 Cuda Cores in a SM where all support FP32 while Turing RTX 2070 has 64 Cuda Cores in a SM where all support FP32, but that is just the design of a multiprocessor. Nvidia can’t contradict itself on its compare page, if you want to list the Cuda Core counts like you did that is perfectly fine and valid to compare FP32 performance through architectures (after you factor in Gigahertz speed and a factor of 2 for FMA to get the theoretical FP32 Teraflops), but no there is no 2 times speedup for FP32 only code; just the design of the GPU and the design of what a streaming multiprocessor is changed. In my expert opinion it is a backwards step for the RTX 4060 to be slower than RTX 2070 which is 2 generations prior for PRPLL NTT; just like the weaking of FP64 over the years (as everyone knows if you want strong FP64 you have to use data center GPUs), how the RTX 4060 and 5060 have only 8GB GPU memory for gaming while 3060 had a 12GB version, etc.

Turing Whitepaper:

https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf

The GeForce RTX 2080 Ti Founders Edition GPU delivers the following exceptional computational
performance:

14.2 TFLOPS of peak single precision (FP32) performance
14.2 TIPS concurrent with FP, through independent integer execution unit

First, the Turing SM adds a new independent integer datapath that can execute
instructions concurrently with the floating-point math datapath. In previous generations,
executing these instructions would have blocked floating-point instructions from issuing.

The Turing architecture features a new SM design that incorporates many of the features
introduced in our Volta GV100 SM architecture. Two SMs are included per TPC, and each SM has
a total of 64 FP32 Cores and 64 INT32 Cores.

The Turing SM supports concurrent execution of FP32 and
INT32 operations (more details below), independent thread scheduling similar to the Volta
GV100 GPU.

Turing implements a major revamping of the core execution datapaths. Modern shader
workloads typically have a mix of FP arithmetic instructions such as FADD or FMAD with simpler
instructions such as integer adds for addressing and fetching data, floating point compare or
min/max for processing results, etc. In previous shader architectures, the floating-point math
datapath sits idle whenever one of these non-FP-math instructions runs. Turing adds a second
parallel execution unit next to every CUDA core that executes these instructions in parallel with
floating point math.

Topic		Replies	Views
FP64 Performance - Power Limitation - H100 vs A100 CUDA Programming and Performance	13	705	January 19, 2026
Technical questions on GTX1080ti multiplication CUDA Programming and Performance	14	2205	November 11, 2017
Forward looking GPU integer performance CUDA Programming and Performance	22	22408	March 20, 2017
Blackwell Integer CUDA Programming and Performance	159	6727	October 31, 2025
High Compute in Flight, low DRAM Bandwidth usage CUDA Programming and Performance	34	737	January 5, 2025
Why the performance of tf32 tensor_core is poor? CUDA Programming and Performance	20	2211	August 8, 2023
Cuda 3.5 Integer Multiply Performance Is it really 3x slower than 64-bit floating point? CUDA Programming and Performance	21	20393	March 12, 2014
performance of new nvidia chip CUDA Programming and Performance	15	6594	January 5, 2010
Attention Lucky GTX 480/GTX 470 Owners! Please run some benchmarks for us. :) CUDA Programming and Performance	88	23196	May 5, 2010
Nvidia GF104 vs GF100 CUDA Programming and Performance	24	23311	October 12, 2010

Integer NTT on RTX 20xx, A100 vs RTX 30xx, 40xx, 50xx

Related topics