Blackwell Integer

rs277 · January 19, 2025, 1:11am

Anyone using Cuda for primarily integer/logic work might be interested to see that all cores in an SM are now INT32 capable, first time since Pascal. Will have to see if any compromise has been made to fit them in.

"With the rise in the use of AI and integer use for such workloads, Nvidia has made all of the shader cores in Blackwell fully FP32/INT32 compatible. "

njuffa · January 19, 2025, 2:11am

This would mean logic operations and integer addition and subtraction with full throughput. What about integer shifts and integer multiply(-add)? Full throughput or performed at reduced rate? Shifters and especially multipliers is where additional cost becomes noticeable.

rs277 · January 19, 2025, 2:13am

Yes, this was what I meant when I said “compromise”. We’ll have to wait and see. The quote from the article is “fully FP32/INT32 compatible”.

rs277 · January 23, 2025, 11:54pm

It appears the graphic and text quoted in the OP is incorrect. Both CC10.0 and CC12.0 show twice as many FP32 units than INT32.

Later: Unless there’s a 10.X variant to be revealed. The graphic is credited to Nvidia though.

Curefab · January 24, 2025, 12:25am

This table, presumably from the architecture whitepaper, which was given to the press, shows the same speed for FP32 and INT32:

→ FP32 as Ada, INT32 doubled.

Is the core count related to the new f32x2 math instructions available on PTX level for CC10.0 and higher?

rs277 · January 24, 2025, 12:45am

I’m a little confused. The table in your link shows the INT32 TOPS the same as the FP32, but as far as I can tell, there doesn’t seem to be an I32x2. There doesn’t seem to be any CC10+ specific integer, (not Tensor) instructions.

Unless the figure of, “64 INT32 cores for integer math”, should be 128, maybe the integer cores can process 2 instructions/cycle?

Curefab · January 24, 2025, 1:01am

Something is wrong, and I believe it is the CC10.0 section as you stated.

Perhaps there are 64+64 mixed FP32/INT32 units, i.e. 16+16 per SM partition.
That means each warp needs 2 cycles and both can be fully used for either FP32 or INT32 or a mixture.

The question is, whether the new f32x2 instructions only reduces the number of instructions (kernel size / constant instruction cache) or whether they can be accelerated, if both pipelines are available or whether they require both to be available.

It could also be that a warp is put into the pipeline within 1 cycle, i.e. 32 FP32/INT32 instead of 16+16 anyway, then f32x2 would have no latency advantage. The architecture image shows the 32 units undivided.

rs277 · January 28, 2025, 7:12pm

Another data point in this, “Old man shouts at clouds” thread, is page 12 here:

"Note that the number of possible INT32 integer operations in Blackwell are doubled compared to
Ada, by fully unifying them with FP32 cores, as depicted in Figure 6 below. However, the unified
cores can only operate as either FP32 or INT32 cores in any given clock cycle. "

njuffa · January 28, 2025, 9:17pm

That is about as clear as it can get: fully unified data path. Simplifies the instruction scheduling, but makes each individual functional unit more complex (the “all singing, all dancing” functional unit), which probably does not matter in a throughput-oriented design.

The natural consequence would be that IMAD and FMAD have the same throughput, which opens up some interesting algorithmic choices. The only other processors I know where IMUL, IMAD have full throughput is lower-end scalar ARM microarchitectures like Cortex M4.

Curefab · January 28, 2025, 10:46pm

Hi Norbert,
why is full IMUL/IMAD integer through-put so seldom? Because of 32 instead of 23+1 mantissa bits for the multiplication?

Does anybody know, whether those are two pipelines with dedicated data ports starting the computation for 16 threads per cycle or one pipeline for all 32 threads of a warp?

njuffa · January 28, 2025, 11:13pm

I have not been involved with the design of processors since the year 2000, so the following is just a rough outline.

If you look at the floor plan of an FPU datapath, that hulking square in the middle is the multiplier, taking up a large portion of the datapath and accounting for most of its power draw. These days, that huge block is actually an FMA most of the time. Fast integer multipliers (say 3-4 cycle latency) are just as large as corresponding FP multipliers, and needing a few more bits (32 vs 24, 64 vs 52) possibly a bit larger.

If one builds separate integer and FP32 datapaths, the integer multipliers are costly in terms of silicon real estate, and since the percentage of integer multiplies in most code is fairly small thanks to compiler optimizations (strength reduction) and LEA-type instructions for addressing computations, going easy on the integer multipliers saves transistors to be used for other features. So one winds up with IMULs at 1/3 or 1/4 the throughput of simple integer arithmetic instructions.

A unified INT32 / FP32 functional unit may increase the size maybe 25% over a FP32 pure play approach, but no area and transistors are needed to build an entire separate datapath for INT32. Whether the unified approach is actually a net win likely depends on the state of silicon technology and implementation methodology, and different choices may result along the time line.

In terms of more commonly used operations, having full throughput IMAD would benefit integer division and modulo operations in CUDA, which are all emulated via loads of IMULs and IMADs under the hood. In recent years fast IDIV implementations appear to have become popular among CPU makers and presumably for good reasons. Apple’s ARM CPUs in particular have been remarked on positively in the media in this regard, so it would be good for GPUs to keep up even without the dedicated division hardware deployed in x86 and ARM processors.

Beyond that, having the cost of an IMAD and an IADD basically the same offers some interesting algorithmic choices.

tonywu93 · January 29, 2025, 5:36am

According to the blackwell whitepaper INT32 TOPs:
RTX4090 – 128 SMs – 41.3 TOPs
RTX5090 – 170 SMs --104.8 TOPs
Curiously, in the Ada whitepaper, it specifically qualified TOPs as IMAD. In Blackwell whitepaper, this is not stated.
I hope they are still measuring IMAD…

aaelick · February 2, 2025, 6:53pm

Just tried my simple int32 only benchmark on 5080. No 2x int32 gain as promised on Blackwell whitepaper… Same performance per core as on 4080.

rs277 · February 2, 2025, 7:05pm

Can I ask what specifically your benchmark is testing?

aaelick · February 2, 2025, 7:23pm

global void addKernel(int* c, const int* a, const int* b)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;

for (int i = 0; i < NUM_ITERATIONS_IN_KERNEL; i++)
{
    c[index] = a[index] + b[index];
}

}

Tried other more complex, but int32 kernels also. Same result.
Also tried sm_89, sm_100, sm_101, sm_120. Best result on sm_89 strangely.

Robert_Crovella · February 2, 2025, 7:59pm

That’s going to be memory bound. it’s not going to give a good indication of integer throughput.

aaelick · February 2, 2025, 8:04pm

0% memory controller load. All calculations from cache. All arrays fits entirely in L2 cache.

Robert_Crovella · February 2, 2025, 8:19pm

So you’ll be at the cache rate, not at the main memory rate. That is still memory bound, treating the cache as part of the memory subsystem. It’s not going to work well to try to measure integer throughput that way. Making things even worse, you have 3 memory operations (two LDG, one STG) per integer op (IADD).

People here in this forum would love to see the output from deviceQuery if you can manage it, on that 5080.

As a throughput calculation example, let’s take RTX 6000 Ada. The non-tensor FP32 rate is 91TF, and the INT32 rate is 44TOps, roughly consistent with the 2:1 rate mentioned already here and elsewhere, for Ada. If we used your kernel to attempt to measure or get close to that rate, at 44TOps we would need 3 additional memory ops (2 LDG, 1 STG). So the observed rate of those global load/store instructions would need to get to 132TOps. Since each one is moving 4 bytes, that converts to over 500TB/s. However the memory bandwidth of the RTX 6000 Ada is only around 1TB/s. The L2 cache bandwidth is not published by NVIDIA, but it is not 500x higher than the global memory bandwidth. A measurement on RTX4090 showed a ratio of global to L2 bandwidth that was closer to 5x than 500x. So the kernel code you have shown will proceed at (best) the rate of the L2 bandwidth, not at anything representative of the integer throughput.

rs277 · February 2, 2025, 8:23pm

I’d just settle for the Instruction Throughput table in the Programming Guide to be updated. : )

Robert_Crovella · February 2, 2025, 8:23pm

You can file a bug.

Topic		Replies	Views
Cuda 3.5 Integer Multiply Performance Is it really 3x slower than 64-bit floating point? CUDA Programming and Performance	21	20027	March 12, 2014
throughput of integer add CUDA Programming and Performance	17	3134	August 15, 2011
Forward looking GPU integer performance CUDA Programming and Performance	22	21870	March 20, 2017
Measurements of different CUDA operator throughputs CUDA Programming and Performance	32	50017	August 24, 2009
performance of new nvidia chip CUDA Programming and Performance	15	6437	January 5, 2010
Intel paper: Debunking the 100X GPU vs. CPU myth CUDA Programming and Performance	36	25407	April 7, 2011
Arithmetic Operations benchmarking with CUDA FERMI Understanding pure performance of arithmetic on F CUDA Programming and Performance	9	1678	October 27, 2010
Mythical Tflops CUDA Programming and Performance	11	1184	January 14, 2019
Attention Lucky GTX 480/GTX 470 Owners! Please run some benchmarks for us. :) CUDA Programming and Performance	88	22554	May 5, 2010
How close to peak can you get on a CPU? CUDA Programming and Performance	33	3011	November 9, 2010

Blackwell Integer

Related topics