Question about 64 Bit Integer Performance


I am currently trying to decide if it would be worth investing in a workstation GPU. I am building an application that calculates large digits of pi, and due to the nature of the problem my code must make use of C/C++'s unsigned long long type (the 52 bit mantissa of a double can not represent the integers I need).
I’ve been unable to find any resources that describe the throughput of any Nvidia GPUs when working with integers of any kind, let alone 64 bit unsigned integers.

I am currently using 2 Titan X (Pascal) cards to run computations, although I had originally purchased them for a gaming PC. This link ( states that a Tesla V100 has about 20x the double precision performance of an individual Titan X, but I have no idea if that at all translates to 64 bit int performance.

Any information about what the performance for 64 bit ints looks like on the Tesla V100 vs the Titan X would help me a lot.

With the exception of converting between 64-bit integer types and floating-point types, the double-precision units in the GPU have no bearing on 64-bit integer performance.

Generally speaking, GPUs are 32-bit machines with 64-bit addressing capability. All 64-bit integer operations are emulated. In the simplest case (logical operations, additions, subtractions) each 64-bit integer operation requires the execution of two 32-bit integer instructions. Very roughly, emulation of 64-bit multiplication requires about twenty 32-bit instructions, 64-bit division requires about eighty 32-bit instructions.

To first order, you can assume simple 32-bit integer operations to have the same throughput as 32-bit floating-point operations. This isn’t necessarily so for every GPU architecture and every 32-bit integer operation, but it should be a good enough approximation for back-of-the-envelope calculations. The CUDA documentation provides more detailed data on the throughput of various operations, including integer ones:

Thanks for the reply. So from what you’ve said, is it correct that the ratio between two devices’ relative 64-bit integer performance would be equal to the ratio between their 32-bit integer performance? Also, what does “multiple instructions” mean in the table you linked?

When the two devices are based on the same GPU architecture, and you are using the same CUDA version on both: yes. If you are using different CUDA versions, or different architectures: approximately yes. Why approximately? Because available hardware instructions can and do change between GPU architectures (there is no binary compatibility, as there is on x86 processors) and emulation sequences can change just like any other software.

The literal meaning of “multiple instructions” is applicable here: more than one instruction. In other words, there is no single hardware instruction that implements the particular operation. It is emulated using an instruction sequence comprising more than one instruction.

How many instructions exactly, and which particular instructions, may depend on the architecture and the CUDA version, therefore the manual cannot tell you what the throughput is. But you could write yourself a micro-benchmark to measure it for the particular architecture and CUDA version you are interested in.

That makes sense. I might try to find a cloud server with a Tesla V100 in order to do some of the micro-benchmarking you mentioned. If it isn’t what I’m hoping for, I’ll probably wait to see what the Turing architecture has to offer.

Part of the efficiency of GPUs is that they are 32-bit architectures with a fairly minimal instruction set. This is very unlikely to change, as Moore’s Law has come to its end, and transistors are needed for special accelerator functionality.

Therefore one should expect that 64-bit integer operations will remain as emulated operations. Note that in terms of throughput of 64-bit operations, a high-end GPU will still run rings around a 64-bit CPU, even though emulation is used. This is why integer-bound projects like prime number searches make heavy use of GPUs, and have done so for years.

Please note that record holder Alex Yee claims that GPUs are not suitable for record-setting PI computations because of communication overhead and memory bandwidth issues: I have not worked on PI computation since the 1980s, so I will take his word for it.

Yes, I am aware that sequential calculation of ALL the digits of pi up to a desired target is not well-suited for GPUs. However, my program does not implement the chudnovsky algorithm nor is it intended for the same purpose. My program is intended to calculate several hex digits of pi beginning from a desired hex digit, using the Bailey-Borwein-Plouffe formula, which has no such memory-bandwidth limitations.

I already suspected you might use the BBP formula, but the problem I see there is: Where is the inherent parallelism that allows for the utilization of 20,0000 threads?

The parallelism comes from the fact that each term of the summation in the formula is completely independent. You can have several thousand to millions of threads stride over the summation, compute a particular term, and add it to a thread specific running sum. Then you can reduce the sums found by all the threads to a singular result. The threads never have to communicate, each thread need only know how many other threads there are. The hardest part really is the modular exponentiation required in each term of the summation, because 64 bit modulus is very slow.

At minimum, the reductions involve inter-thread communication. As I said, I haven’t looked into details of PI computations in a long time. When I played with them in the early 1980s, I was still using arctangent formulas. My memory is hazy, but I think my first version was written in BASIC!

I would assume that for modular exponentiation, you are using Montgomery multiplication, which costs basically three multiplies and some cleanup code; no built-in modulo operations involved.

Yeah, the reductions of course requiring some thread synchronization, but the great thing is that the reduction step only needs to be performed once at the very end after all the terms of the BBP summation have been collected into buckets by each thread. In fact, the reduction step is by far the shortest step, and its time complexity isn’t dependent upon the digit being calculated, only the number of threads in use.

I actually hadn’t heard of Montgomery multiplication until now, I’ve been using left-to-right binary exponentiation, and a few tricks to reduce the number of modulus needed. I’m gonna read over that link more, I want to understand the concept rather than just copy and paste it.

I’ve also never had the chance to program in BASIC, and it was only briefly mentioned in my college programming languages course. The language itself is actually older than both of my parents!

That’s definitely the best way to tackle technical problems. As I recall, Montgomery multiplication isn’t exactly trivial conceptually and you may need to consult multiple sources to find one that explains it well. But I see that Google brings up myriad articles and lecture notes when I search for “Montgomery multiplication”, so people with just about any background should be able to find something that works for them.

As a programming language BASIC was very … basic and really a pretty terrible way to learn programming, inviting newbies to write the worst kind of spaghetti code. But with computers based on 4 MHz 8-bit processors with a whopping 16KB of system memory, that was pretty much the upper limit of what one could accomplish.

I looked through a few sources about montgomery multiplication I found through google, and I think I have a good grasp on it now. I used the code from this link (, and modified it slightly to use some 64-bit multipliers I made in PTX.

The new version of the program runs in just 52-60% of the runtime of the previous version, closer to 50% the larger the digit being calculated. Thank you for telling me about this, I doubt I would’ve come across it without you!