Cuda 3.5 Integer Multiply Performance Is it really 3x slower than 64-bit floating point?

I was looking through the cuda 5 preview documentation, and the updated table of arithmetic instructions throughput shows that the to-be-released compute capability 3.5 cards’ multiprocessors will have a 32-bit integer throughput of only 32 per clock cycle. Is there any chance this is an error? Amazingly, the reported 64-bit floating point operations’ throughput is 64 per clock cycle. Does 64-bit floating point multiplication really have triple the throughput of 32-bit integer multiplication?

When CUDA 4.2 was first released, there were some errors in the table in the Programming Guide that lists the operation throughput. We double checked everything and an updated Programming Guide was posted. sm_30 and sm_35 have identical 32-bit IMUL / IMAD throughput, and 32 operations per cycle is the correct throughput as far as I am aware. Since double precision math is performed by a different HW unit, there is no connection between the instruction throughput for DP multiplies and integer multiplies.

The sm_3x throughput ratio of IMUL to IADD appears to be on par with modern x86 CPUs ( lists reciprocal throughput of ADD and MUL as 0.33 and 2 on SandyBridge, giving a throughput ratio of 1:6), so 32-bit IMUL on sm_3x doesn’t seem to be particularly underprovisioned. Is your question motivated by concern for the performance of a particular application or class of applications?

A possible hack on GK110 is to perform 32-bit integer mul/mad in the 64-bit floating point units since doubles have plenty of exact integer precision.

I’m using GPUs to work in the area of number theory and cryptography, where integer math is used quite heavily. I have a background in digital logic / computer hardware design, which is why it strikes me as odd that (significantly more complicated) 64-bit floating point operations are cheaper than 32-bit integer ones, however I see now that this is pretty typical for x86 ALUs.

For my applications, which rely very heavily on IMUL and IMAD, it would seem that the Kepler architecture is actually a step back from Fermi in terms of IMAD/IMUL throughput:

The GTX 480 (Fermi) which I am currently using has 15 SMs with an IMAD/IMUL throughput of 16 / clock. With a clock running at 1.4 GHz, I can estimate my maximum IMAD/IMUL throughput at 15 x 16 x 1.4 = 336 giga IMAD/MUL per second. The GTX 680 (Kepler), on the other hand, has 8 SMX with an IMAD/IMUL throughput of 32, and a clock rate of about 1 GHz giving: 8 x 32 x 1 = 256 giga IMAD/MUL per second.

I appreciate that the Kepler cards have a whole bunch of other improvements, and I am especially looking forward to the new GK110s and their increased maximum number of registers per thread among other features, but I am disappointed that raw arithmetic performance for my applications might actually go down.

Ah, I’ve considered using this, but my problem is that I’m using large (at least several hundred bit) integers. I also need to use operations with carry-in and carry-out. A 64-bit float gives you 53 bits of precision, and doesn’t have any convenient mechanism to handle carry propagation; this leaves me with two problems: 1 - I’ll need more words to hold my values than, say, normal 64-bit integers, and 2 - I’d need some sort of way to determine if I’ve got a carry in/out or an overflow has occurred. I can’t think of any ways to do this that take less time than just using the slower IMAD/IMUL instructions.

Now that a sufficiently large body of real-life CUDA applications exists, architectural decisions are also informed by the instruction usage of those applications, in particular to create efficient designs with high perf/mm2 and high perf/watt. The high DP performance of GK110 is clearly required for its intended market. You are correct that 32-bit integer multipliers are not as expensive as DP floating-point multipliers, but they do require additional area.

As allanmac points out, double-precision denormals can be used for integer arithmetic up to 52 bits since our GPUs handle denormals at full speed, although admittedly this is a bit hacky. I am aware of the heavy use of integer multiply (actually, integer multiply-add) in certain crypto algorithms such as RSA, and prime number searches, but was wondering whether there are other application areas with similar workloads. Since the exact specifications for GK110-based products are not available yet, I think it is too early to conclude that your application level performance on GK110 will decline from GTX480 levels.

Ah, I see. It’s fun to think about how to exploit the 192 32-bit floats/clock though – 6x over 32-bit integer throughput.

Maybe one of the classic large number multiplication algorithms can be modified to use Kepler’s extreme FP capacity?

Good luck!

Yes. Random number generation is often limited by the speed of integer ops… shifts, xors, muls, and adds.

One of my major long term research areas is designing a fast and high quality PRNG for use specifically in CUDA.

Even now, sm 1.1, 1.3, and 2.0 each have completely different efficiencies, enough that it makes a significant sense to use a different PRNG algorithm for each, (and yet another PRNG for CPU, where we can take advantage of very fast 64 bit integer mults). Multiplies are particularly useful for “bit mixing” steps, since they propagate details in low order bits into complex updates of the high order bits. But the converse is not true, so you usually need >> shifting to move entropy the other way. (Though in SM 2.0, the __brev() and __byte_perm() intrinsics are even better.)

I have not yet gotten any SM 3.0 hardware, but I’m sure I’ll need to optimize yet another PRNG for its own characteristics.

The very same kind of integer mixing math is also used for hash tables. You might imagine managing your own hash table to look for data collisions (or cache values) and need to randomly map an integer into one of your (limited) bins. Perhaps each bin is an index into a shared memory array, for example. Then often you’ll form a hash with a small compute… a typical example that somewhat randomly but very efficiently maps any integer into an index from 0 to 1023 inclusive would be something like the expression (0x63452361*keyValue)>>22 . Better hashes usually use even more multiplies and mixing… but notice even that simplest example is dominated by that integer multiply. And non-power-of-2 cache sizes are even trickier in CUDA since the integer modulus operator must be avoided at all costs for efficiency reasons.

But finally, probably the most common overhead of integer multiplies is small but ubiquitous… index computation. You might explicitly compute int tid = blockIdx.x*blockDim.x+threadIdx.x in the majority of kernels you write, and while that multiply isn’t a major bottleneck, it might amortize over all CUDA kernels to be significant since it’s so common. And even array indices like x[i] require an integer add-and-multiply.

Given that GK110 will have 64 FP64 units per each of the 15 SMX we could estimate DP performance to be somewhere in the range of 64 * 15 * 2 * (0.7 <-> 1 ) Ghz => 1344 <-> 1920 DP GFLOPS. I guess utilizing these for integer operations could give a significant boost over 500 series GPU.

Btw, won’t the maximum int32 multiply-add performance of GK110 be 15 * 32 * 2 * (0.7 <-> 1) => 672 <-> 960 giga integer operations per second ?

I worked with the CURAND team for a bit and am aware of the tradeoffs between integer multiplication and DP multiplication there. I am also aware of efficient mixing functions with good avalanching behavior, but none of the ones I have used were based on integer multiplies, rather they were all based on logical operations plus shifts, or on DES-like SBOXes. I’d be interested in pointers to relevant papers about integer multiply-based mixing functions.

The use of integer multiplies in indexing is common and well understood and therefore very likely is sufficiently represented by the various CUDA applications the architecture team tracks. Note that in many cases multiplies present in the source code disappear due to compiler optimizations (unrolling, strength reduction and creation of induction variables). Some integer multiplies in indexing also map to left shift or left-shift-add instructions.

Application performance is usually influenced by many factors, rarely is it limited by the performance of a single operation. Looking at overall CUDA application performance, I would be surprised if any except a few that are known to be bound by integer multiplication throughput, such as RSA encryption or prime number search by trial division, would see a significant impact from the different IADD/IMUL throughput ratio on Kepler vs Fermi. Once GK110-based GPUs ship, it will be interesting to see what kind of real-life application-level performance is observed, and I will certainly watch out for any trends that may become apparent.

Since I provided the integer division implementations for Fermi (both 32-bit and 64-bit versions): As far as I can tell, the throughput ratio of integer division (and by extension, integer modulo, which is just a few instructions more) to integer add on Fermi compares favorably to modern x86 CPUs. Integer division is a fairly slow operation on any modern CPU platform I have worked with (x86, ARM, SPARC, PowerPC). Therefore issues with slow division tend to be universal, rather than specific to CUDA. The CUDA compiler implements the common optimizations for integer division and modulo by constants. Performance of the current implementation of integer division / modulo on Fermi and Kepler is strongly correlated with the throughput of integer multiplication, which dominates the relevant code sequences.

Can I ask you which document you are referring exactly to?

CUDA_C_Programming_Guide.pdf, Section 5.4.1, Table 2. (At least in the CUDA 5.0 version of the document.)

Ah good, thanks man ;)

Hi Jim, I think your math is right but can you tell me where is that multiply by 2 come from? Thanks

The “2” is from fused multiply accumulate ( 2 ops / cc ).

I know the thread is a bit old, by I am currently facing problems in the transition from Fermi to Kepler as kleboeuf.
I work on cryptographic/cryptanalytic applications and I have a large implementation for Fermi that I am now benchmarking on Kepler. I played with block and grid sizes but the performance is always almost twice as slow (GTX 580 vs GTX 680).
This makes perfect sense if one looks at arithmetic instruction throughput, as I am basically doing only 32-bit integer mul and muladd operations.
I do not care much about how Kepler compares to Intel or AMD CPUs but if it is confirmed that Kepler
is unfortunately a step back from Fermi for crypto apps.
I would like to know if someone tried a solution like using 32-bit or 64-bit floating point to implement large integer arithmetic.
I thought of using 32-bit floating point mul and add (to exploit their high throughput) to implement large integer addition and multiplication, but handling carries becomes ugly and in addition each 32-bit integer digit must be now split in 2 or 3 fp digits.
So considering the cost of conversion, plus carry handling, plus the increase in the number of digits by a factor of 2-3,
what speed-up could we expect (if any)?

I cannot really second this for my crypto related application.

cudaminer has gained significantly for various hashing algorithms on the Kepler architecture (scrypt, keccak) over Fermi. It just took a while (and help from third parties) before I was able to unlock this potential.

The funnel shifter feature on Compute 3.5 helps speeding up ROL and ROR 32 and 64 bit rotates that are often used in crypto. The transition to Maxwell even further improved the situation by adding a lot more power efficiency.

But I am not doing large integer multiplications in my code, so your situation may differ.


Note that GTX680 is sm_30, not sm_35. I am aware that scaling for integer-multiply heavy code from GTX580 to GTX680 is not that great, although I am a bit surprised you report half the performance: that is worse than what I recall (I do not work hands-on with sm_30). I am wondering whether there are other bottlenecks in the code.

In practical terms, you would want to make sure that the long-integer multiplication is performed as efficiently as possible. In my experience, this means hand-coding the primitives using inline PTX, and using the widest primitives possible instead of going with narrow limbs. I realize that there are pretty tight limits due to at most 63 registers being available per thread. You should be able to use 256-bit primitives per thread, though. I have 256 x 256 bit multiplication code that I could post if it would help. [Later:] I see I already posted it to the forums, here:

When building long-integer arithmetic from floating-point operations, one historical scheme is to split operands into 24-bit “digits” which are stored in floats, then load these, convert them to double precision and finally use a double-precision multiply accumulate. The five extra bits allow deferred carry propagation when partial products are accumulated column-wise. sm_30 has slow double precision, so I do not see how that would help. Right now I don’t have any idea how one could utilize single-precision FMA effectively for this application, maybe someone else has an idea.

[added later:] For rough guidance on relative integer multiply performance for a real application that is limited by integer multiply performance, I have in the past used this summary of mfaktc benchmark results. It shows the GTX 680 at 56% the performance of the GTX 580, which jibes with vonneumann’s observations:

Hi all,
and thanks for your replies.
To answer cbuchner1 let me just specify that I am dealing with public key cryptanalysis mostly,
but also public key crypto like batch RSA.
In the code I mentioned, I deal with variable size large integers (at most 400 bits) and do modular arithmetic.
I have PTX code for my arithmetic routines with massive use of integer muladd( and
my modular multiplication performance is close to the ideal peak performance (e.g., the achieved arithmetic instruction throughput is very close to the ideal throughput imposed by device limits).
In practice I have versions of multi-precision modular arithmetic for different limb sizes (again between 64 and 400 bits).
I plan to release my code in the future, but I cannot do it now.
The main problem for me remains whether the conversion of my arithmetic to a single-precision fp version is an avenue worth walking through or I had better wait for Maxwell/Volta…

Thinking out loud here: Have you considered using 16-bit “digits”, each stored in a float? This would allow the computation of a double-wide product with the help of FFMA, with plenty of bits left over accumulation of partial products in each column allowing carry propagation to be deferred until the very end of the multiplication. The core code might look something like this:

hi = truncf (__fmul_rz (__fmul_rz (a[i], b[j]), 1.5258789e-5f));
lo = fmaf (a[i], b[j], -hi * 65536.0f);
col_sum_hi += hi;
col_sum_lo += lo;

I am not entirely sure that this computation of “hi” works as intended, and I haven’t tested it.