Question about 64 Bit Integer Performance

euphoricpoptarts · August 15, 2018, 5:19am

Hello,

I am currently trying to decide if it would be worth investing in a workstation GPU. I am building an application that calculates large digits of pi, and due to the nature of the problem my code must make use of C/C++'s unsigned long long type (the 52 bit mantissa of a double can not represent the integers I need).
I’ve been unable to find any resources that describe the throughput of any Nvidia GPUs when working with integers of any kind, let alone 64 bit unsigned integers.

I am currently using 2 Titan X (Pascal) cards to run computations, although I had originally purchased them for a gaming PC. This link (Comparison Between NVIDIA GeForce and Tesla GPUs) states that a Tesla V100 has about 20x the double precision performance of an individual Titan X, but I have no idea if that at all translates to 64 bit int performance.

Any information about what the performance for 64 bit ints looks like on the Tesla V100 vs the Titan X would help me a lot.

njuffa · August 15, 2018, 6:29am

With the exception of converting between 64-bit integer types and floating-point types, the double-precision units in the GPU have no bearing on 64-bit integer performance.

Generally speaking, GPUs are 32-bit machines with 64-bit addressing capability. All 64-bit integer operations are emulated. In the simplest case (logical operations, additions, subtractions) each 64-bit integer operation requires the execution of two 32-bit integer instructions. Very roughly, emulation of 64-bit multiplication requires about twenty 32-bit instructions, 64-bit division requires about eighty 32-bit instructions.

To first order, you can assume simple 32-bit integer operations to have the same throughput as 32-bit floating-point operations. This isn’t necessarily so for every GPU architecture and every 32-bit integer operation, but it should be a good enough approximation for back-of-the-envelope calculations. The CUDA documentation provides more detailed data on the throughput of various operations, including integer ones:

[url]Programming Guide :: CUDA Toolkit Documentation

euphoricpoptarts · August 15, 2018, 9:58pm

Thanks for the reply. So from what you’ve said, is it correct that the ratio between two devices’ relative 64-bit integer performance would be equal to the ratio between their 32-bit integer performance? Also, what does “multiple instructions” mean in the table you linked?

njuffa · August 15, 2018, 10:10pm

When the two devices are based on the same GPU architecture, and you are using the same CUDA version on both: yes. If you are using different CUDA versions, or different architectures: approximately yes. Why approximately? Because available hardware instructions can and do change between GPU architectures (there is no binary compatibility, as there is on x86 processors) and emulation sequences can change just like any other software.

The literal meaning of “multiple instructions” is applicable here: more than one instruction. In other words, there is no single hardware instruction that implements the particular operation. It is emulated using an instruction sequence comprising more than one instruction.

How many instructions exactly, and which particular instructions, may depend on the architecture and the CUDA version, therefore the manual cannot tell you what the throughput is. But you could write yourself a micro-benchmark to measure it for the particular architecture and CUDA version you are interested in.

euphoricpoptarts · August 15, 2018, 10:20pm

That makes sense. I might try to find a cloud server with a Tesla V100 in order to do some of the micro-benchmarking you mentioned. If it isn’t what I’m hoping for, I’ll probably wait to see what the Turing architecture has to offer.

njuffa · August 15, 2018, 10:32pm

Part of the efficiency of GPUs is that they are 32-bit architectures with a fairly minimal instruction set. This is very unlikely to change, as Moore’s Law has come to its end, and transistors are needed for special accelerator functionality.

Therefore one should expect that 64-bit integer operations will remain as emulated operations. Note that in terms of throughput of 64-bit operations, a high-end GPU will still run rings around a 64-bit CPU, even though emulation is used. This is why integer-bound projects like prime number searches make heavy use of GPUs, and have done so for years.

Please note that record holder Alex Yee claims that GPUs are not suitable for record-setting PI computations because of communication overhead and memory bandwidth issues: [url]http://www.numberworld.org/y-cruncher/faq.html[/url]. I have not worked on PI computation since the 1980s, so I will take his word for it.

euphoricpoptarts · August 15, 2018, 10:38pm

Yes, I am aware that sequential calculation of ALL the digits of pi up to a desired target is not well-suited for GPUs. However, my program does not implement the chudnovsky algorithm nor is it intended for the same purpose. My program is intended to calculate several hex digits of pi beginning from a desired hex digit, using the Bailey-Borwein-Plouffe formula, which has no such memory-bandwidth limitations.

njuffa · August 16, 2018, 2:34am

I already suspected you might use the BBP formula, but the problem I see there is: Where is the inherent parallelism that allows for the utilization of 20,0000 threads?

euphoricpoptarts · August 16, 2018, 2:45am

The parallelism comes from the fact that each term of the summation in the formula is completely independent. You can have several thousand to millions of threads stride over the summation, compute a particular term, and add it to a thread specific running sum. Then you can reduce the sums found by all the threads to a singular result. The threads never have to communicate, each thread need only know how many other threads there are. The hardest part really is the modular exponentiation required in each term of the summation, because 64 bit modulus is very slow.

njuffa · August 16, 2018, 3:32am

At minimum, the reductions involve inter-thread communication. As I said, I haven’t looked into details of PI computations in a long time. When I played with them in the early 1980s, I was still using arctangent formulas. My memory is hazy, but I think my first version was written in BASIC!

I would assume that for modular exponentiation, you are using Montgomery multiplication, which costs basically three multiplies and some cleanup code; no built-in modulo operations involved.

euphoricpoptarts · August 16, 2018, 4:06am

Yeah, the reductions of course requiring some thread synchronization, but the great thing is that the reduction step only needs to be performed once at the very end after all the terms of the BBP summation have been collected into buckets by each thread. In fact, the reduction step is by far the shortest step, and its time complexity isn’t dependent upon the digit being calculated, only the number of threads in use.

I actually hadn’t heard of Montgomery multiplication until now, I’ve been using left-to-right binary exponentiation, and a few tricks to reduce the number of modulus needed. I’m gonna read over that link more, I want to understand the concept rather than just copy and paste it.

I’ve also never had the chance to program in BASIC, and it was only briefly mentioned in my college programming languages course. The language itself is actually older than both of my parents!

njuffa · August 16, 2018, 7:17am

That’s definitely the best way to tackle technical problems. As I recall, Montgomery multiplication isn’t exactly trivial conceptually and you may need to consult multiple sources to find one that explains it well. But I see that Google brings up myriad articles and lecture notes when I search for “Montgomery multiplication”, so people with just about any background should be able to find something that works for them.

As a programming language BASIC was very … basic and really a pretty terrible way to learn programming, inviting newbies to write the worst kind of spaghetti code. But with computers based on 4 MHz 8-bit processors with a whopping 16KB of system memory, that was pretty much the upper limit of what one could accomplish.

euphoricpoptarts · August 18, 2018, 12:34am

I looked through a few sources about montgomery multiplication I found through google, and I think I have a good grasp on it now. I used the code from this link (http://www.hackersdelight.org/hdcodetxt/mont64.c.txt), and modified it slightly to use some 64-bit multipliers I made in PTX.

The new version of the program runs in just 52-60% of the runtime of the previous version, closer to 50% the larger the digit being calculated. Thank you for telling me about this, I doubt I would’ve come across it without you!

Topic		Replies	Views
Forward looking GPU integer performance CUDA Programming and Performance	22	22314	March 20, 2017
How much speed of 64bit integer algebra in the latest GPUs? CUDA Programming and Performance	2	2127	April 21, 2014
the superior 680 / 690 gpu how many cycles is 32 x 32 == 64 bits integer CUDA Programming and Performance	4	3349	May 2, 2012
Cuda 3.5 Integer Multiply Performance Is it really 3x slower than 64-bit floating point? CUDA Programming and Performance	21	20315	March 12, 2014
Are operations of add, sum, multiplication and division equivalent in performance regardless of the number of bits? CUDA Programming and Performance	5	1091	August 4, 2023
How to tell if GPU cores are actually 32/64-bit processors CUDA Programming and Performance	11	13226	February 16, 2017
TITAN V / Tesla async 64-bit core CUDA Programming and Performance	7	960	January 4, 2018
int64 support and speed? CUDA Programming and Performance	8	11293	September 1, 2010
estimate 64bit integer instruction throughput CUDA Programming and Performance	4	962	September 29, 2018
Peak Performance of integer operation CUDA Programming and Performance	3	2970	May 11, 2017

Question about 64 Bit Integer Performance

Related topics