Anyone using Cuda for primarily integer/logic work might be interested to see that all cores in an SM are now INT32 capable, first time since Pascal. Will have to see if any compromise has been made to fit them in.
"With the rise in the use of AI and integer use for such workloads, Nvidia has made all of the shader cores in Blackwell fully FP32/INT32 compatible. "
This would mean logic operations and integer addition and subtraction with full throughput. What about integer shifts and integer multiply(-add)? Full throughput or performed at reduced rate? Shifters and especially multipliers is where additional cost becomes noticeable.
Yes, this was what I meant when I said “compromise”. We’ll have to wait and see. The quote from the article is “fully FP32/INT32 compatible”.
I’m a little confused. The table in your link shows the INT32 TOPS the same as the FP32, but as far as I can tell, there doesn’t seem to be an I32x2. There doesn’t seem to be any CC10+ specific integer, (not Tensor) instructions.
Unless the figure of, “64 INT32 cores for integer math”, should be 128, maybe the integer cores can process 2 instructions/cycle?
Something is wrong, and I believe it is the CC10.0 section as you stated.
Perhaps there are 64+64 mixed FP32/INT32 units, i.e. 16+16 per SM partition.
That means each warp needs 2 cycles and both can be fully used for either FP32 or INT32 or a mixture.
The question is, whether the new f32x2 instructions only reduces the number of instructions (kernel size / constant instruction cache) or whether they can be accelerated, if both pipelines are available or whether they require both to be available.
It could also be that a warp is put into the pipeline within 1 cycle, i.e. 32 FP32/INT32 instead of 16+16 anyway, then f32x2 would have no latency advantage. The architecture image shows the 32 units undivided.
Another data point in this, “Old man shouts at clouds” thread, is page 12 here:
"Note that the number of possible INT32 integer operations in Blackwell are doubled compared to
Ada, by fully unifying them with FP32 cores, as depicted in Figure 6 below. However, the unified
cores can only operate as either FP32 or INT32 cores in any given clock cycle. "
That is about as clear as it can get: fully unified data path. Simplifies the instruction scheduling, but makes each individual functional unit more complex (the “all singing, all dancing” functional unit), which probably does not matter in a throughput-oriented design.
The natural consequence would be that IMAD and FMAD have the same throughput, which opens up some interesting algorithmic choices. The only other processors I know where IMUL, IMAD have full throughput is lower-end scalar ARM microarchitectures like Cortex M4.
Hi Norbert,
why is full IMUL/IMAD integer through-put so seldom? Because of 32 instead of 23+1 mantissa bits for the multiplication?
Does anybody know, whether those are two pipelines with dedicated data ports starting the computation for 16 threads per cycle or one pipeline for all 32 threads of a warp?
I have not been involved with the design of processors since the year 2000, so the following is just a rough outline.
If you look at the floor plan of an FPU datapath, that hulking square in the middle is the multiplier, taking up a large portion of the datapath and accounting for most of its power draw. These days, that huge block is actually an FMA most of the time. Fast integer multipliers (say 3-4 cycle latency) are just as large as corresponding FP multipliers, and needing a few more bits (32 vs 24, 64 vs 52) possibly a bit larger.
If one builds separate integer and FP32 datapaths, the integer multipliers are costly in terms of silicon real estate, and since the percentage of integer multiplies in most code is fairly small thanks to compiler optimizations (strength reduction) and LEA-type instructions for addressing computations, going easy on the integer multipliers saves transistors to be used for other features. So one winds up with IMULs at 1/3 or 1/4 the throughput of simple integer arithmetic instructions.
A unified INT32 / FP32 functional unit may increase the size maybe 25% over a FP32 pure play approach, but no area and transistors are needed to build an entire separate datapath for INT32. Whether the unified approach is actually a net win likely depends on the state of silicon technology and implementation methodology, and different choices may result along the time line.
In terms of more commonly used operations, having full throughput IMAD would benefit integer division and modulo operations in CUDA, which are all emulated via loads of IMULs and IMADs under the hood. In recent years fast IDIV implementations appear to have become popular among CPU makers and presumably for good reasons. Apple’s ARM CPUs in particular have been remarked on positively in the media in this regard, so it would be good for GPUs to keep up even without the dedicated division hardware deployed in x86 and ARM processors.
Beyond that, having the cost of an IMAD and an IADD basically the same offers some interesting algorithmic choices.
According to the blackwell whitepaper INT32 TOPs:
RTX4090 – 128 SMs – 41.3 TOPs
RTX5090 – 170 SMs --104.8 TOPs
Curiously, in the Ada whitepaper, it specifically qualified TOPs as IMAD. In Blackwell whitepaper, this is not stated.
I hope they are still measuring IMAD…
So you’ll be at the cache rate, not at the main memory rate. That is still memory bound, treating the cache as part of the memory subsystem. It’s not going to work well to try to measure integer throughput that way. Making things even worse, you have 3 memory operations (two LDG, one STG) per integer op (IADD).
People here in this forum would love to see the output from deviceQuery if you can manage it, on that 5080.
As a throughput calculation example, let’s take RTX 6000 Ada. The non-tensor FP32 rate is 91TF, and the INT32 rate is 44TOps, roughly consistent with the 2:1 rate mentioned already here and elsewhere, for Ada. If we used your kernel to attempt to measure or get close to that rate, at 44TOps we would need 3 additional memory ops (2 LDG, 1 STG). So the observed rate of those global load/store instructions would need to get to 132TOps. Since each one is moving 4 bytes, that converts to over 500TB/s. However the memory bandwidth of the RTX 6000 Ada is only around 1TB/s. The L2 cache bandwidth is not published by NVIDIA, but it is not 500x higher than the global memory bandwidth. A measurement on RTX4090 showed a ratio of global to L2 bandwidth that was closer to 5x than 500x. So the kernel code you have shown will proceed at (best) the rate of the L2 bandwidth, not at anything representative of the integer throughput.