An indication, going the other way, can be seen further up the thread, where I ran Norbert’s first benchmark, of mixed integer instructions, on a GTX 1060, with 10 SM’s.
Pascal had 128 INT32 cores/SM and I got a score of 2.02 Tiops/sec.
Norbert’s run on the next gen Turing RTX4000, with 30 SM’s and 64 INT32 cores/SM, got 2.24 Tiops/sec, although he does mention a possible extenuating factor, not deemed that significant.
It is not only in the whitepaper, but in the SM diagram. Yes, all those diagrams are slightly simplified, but it is more than a quick sentence somewhere.
deviceQuery for a 5070ti, FWIW. Apologies if the 5070ti is off topic from int32. I landed here just trying to make VLLM work.
CUDA Device Query (Runtime API)
Detected 1 CUDA Capable device(s)
Device 0: “NVIDIA Graphics Device”
CUDA Driver Version / Runtime Version 12.8 / 12.8
CUDA Capability Major/Minor version number: 12.0
Total amount of global memory: 15851 MBytes (16620847104 bytes)
MapSMtoCores for SM 12.0 is undefined. Default to use 128 Cores/SM
MapSMtoCores for SM 12.0 is undefined. Default to use 128 Cores/SM
(70) Multiprocessors, (128) CUDA Cores/MP: 8960 CUDA Cores
GPU Max Clock rate: 2588 MHz (2.59 GHz)
Memory Clock rate: 14001 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 50331648 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.8, CUDA Runtime Version = 12.8, NumDevs = 1, Device0 = NVIDIA Graphics Device
Result = PASS
Somehow the roll-out of this new architecture does not seem to be well organized (c.f. the confusion around the compute capability number: 10.0 as officially published but apparently really 12.0), which I find curious given the vast resources the company has at its disposal.
Maybe the “real roll-out” was originally scheduled for GTC, March 17? One can hope.
10.0 is Enterprise Blackwell, probably including the announced Digits workstation.
So one asks, why there is a version gap within one architecture?
Four possible reasons
Nvidia intends to create more Enterprise specific or non-upwards compatible features and wants to clearly distinguish
It was a last minute decision to use those specific SMs and they actually are not Blackwell
Enterprise and Consumer is more and more different, especially the Tensor Core architecture, which doubles with every Enterprise generation and often stagnates for consumers
The major version should be aligned between the architecture and the toolkit version, and Nvidia did not care, that the decision was made within the Blackwell version splitting their devices
That’s what I figured after it was pointed out to me today that RTX 5090 has CC 12.0, although NVIDIA’s website currently incorrectly states CC 10.0 (a fix is in the works, I am told).
While these hypotheses seem plausible, I would expect more from NVIDIA than to create this (mild) chaos. The company is doing exceedingly well and able to hire the very best employees the labor market has to offer. Plus it has been shipping CUDA for 18 years; the release process should function like clock work by now.
Up till now, that was just a minor version difference.
E.g. GK210 in the Tesla K80 - 128 KiB instead of 64 KiB L1, 128K instead of 64K registers, 64 FP64 → CC 3.7
Or GP100 - HBM memory: CC 6.0
Or Volta vs. Turing: CC 7.0 vs. 7.5
The Enterprise Blackwell is special, it has 256 KiB Tensor Memory per SM. That is a lot, all other generations do not have this feature, all least not exposed.
On March 20, I put a Nvidia support ticket in regarding page 11-12 of the Nvidia RTX Blackwell GPU Architecture Document.
I asked “Please advise for consumer RTX 5000 series graphics cards if they do or do not have double INT32 performance (i.e. can use all shaders for INT32 rather than half) for Blackwell consumer Geforce 5000 series graphics cards compared to previous Ada Lovelace consumer Geforce 4000 series graphics (which can use half the shaders for INT32)”.
On April 9 Nvidia replied “There is no error in the Blackwell architecture document. Yes, the consumer GeForce RTX 5000 series graphics cards do have double INT32 performance compared to the Ada Lovelace-based RTX 4000 series”
If I understand the information at the two linked Primgrid forum threads correctly, the posters there have not been able to demonstrate this claimed 2x performance advantage when comparing 4000 series and 5000 series.
If so, we would appear to be pretty much where we were at the start of this thread in terms of information.
Even though my Aida64 Extreme Trial Version GPGPU benchmark screenshot at Primegrid shows my RTX 4060 Ada Lovelace having single-precision FLOPS being twice the value of 32-bit integer IOPS (which architecture wise would be in agreement with Ada Lovelace architecture document), the creator (Yves Gallot) of one of the programs used to find world record prime numbers states the below which seems to me to basically say, if I am paraphrasing correctly, consumer Ada Lovelace ALREADY CAN do IOPS = FLOPS (rather than FLOPS being twice IOPS) by doing a MAD instruction on half the Ada Lovelace shaders and other instructions (add, shift, logical operations, etc) on the other Ada Lovelace shaders. The way he words it below makes it align with what is in the Nvidia CUDA maximum instruction throughput section for the RTX 4060 (Compute Capability 8.9)
------Quoting Yves:
Compute Capability 8.6 (GeForce 30): SM = 64 MAD32_64/FP32 + 64 INT32/FP32 + 2 FP64. Half of the cores are able to execute a MAD instruction z += x * y, where x, y are 32-bit integers, the result of the multiplication and z are 64-bit integers. The other half of the cores execute other instructions (add, shift, logical operations, etc).
Compute Capability 8.9 (GeForce 40): SM are identical to 8.6 but process size is 5 nm (Ampere was 8 nm) then GPU is operating at higher frequency. More importantly, L2 cache size is 10x: 40x0 are at least 50% faster than 30x0.
For Nvidia GPU, there is no 24-bit integer then 24-bit IOPS = 32-bit IOPS. For AMD GPU, a 24-bit multiplication is three times as fast as a 32-bit multiplication and there is no MAD instruction, but a MUL operation.
FLOPS is twice the number of FP operations per second because of FMA.
How IOPS is defined? A priori x2 for IMAD, but there is also a 3-input integer addition (IADD3) which is two additions (genefer code is a list of IMAD and IADD3 and some tests and conditional move for modular addition/substraction).
We should have for
40 series: 2 cores can execute 2 FMA or 1 IADD3 and 1 IMAD => 4 FLOP or 4 IOP. Then FLOPS = IOPS.
50 series: each core can execute 1 FMA or 1 IADD3 or 1 IMAD => 2 FLOP or 2 IOP. Then FLOPS = IOPS.
There is no 64-bit integer. Then a 64-bit addition is two 32-bit additions and a 64-bit MAD is four 32-bit MAD. Note that every instruction supports carry-in and carry-out.
njuffa the posts in the Primegrid thread are proper and there is a same INT32 performance between Blackwell and Ada Lovelace as the genefer software for Ada Lovelace does 1 IADD3 AND 1MAD at the same time (between 2 cores) negating any “theoretical” 2x increase for Blackwell where each core can 1 do of either instruction.
Curefab, where in this thread does it say benchmarks for INT32 for Blackwell is HALF as fast as FP32?
Also, for INT32, looking at Yves quote below, when instructions are “individual” and not “mixed” is this thread saying the supposed 2x INT32 performance increase isn’t happening?
-----Quoting Yves:
It is both true and false.
If the number of cores is n, Ada Lovelace can execute n/2 IADD3 instructions per cycle and Blackwell n IADD3 instructions. Ada Lovelace can execute n/2 IMAD and Blackwell n IMAD. But Ada Lovelace can execute n/2 IADD3 and n/2 IMAD per cycle in the same way as Blackwell.
If a benchmark evaluates instructions individually then INT32 performance is 2x. But in practice, where instructions are mixed, INT32 performance is 1x.