Forward looking GPU integer performance

I am examining future possibilities of moving certain types of game related simulation code that are integer based to the GPU currently and I was looking at integer performance across a range of nvidia GPU’s.

From my googling integer performance seems to still lag behind float32 performance significantly despite some sources claiming that these days they are equal (which I have yet to find proof of). I ran the CUDA-Z test across a range of GPU’s and got the following relevant results:

		GTX1080		GTX980Ti	GTX970		GTX680

float32 GFLOP/S 8450 7285 3387 2525
int64 GIOP/s 606 489 248 142
int32 GIOP/s 2840 2457 1005 567

Currently I am not sure how good a benchmark Cuda-Z is but the results don’t surprise me.

Some of what I am looking to move to the GPU uses 64bit integers on the CPU as they benefit greatly from wide integers. It appears that even on the GTX1080 64bit integers are emulated with 32bit ones and that I would probably be better off manually re-organising around more complicated 32bit integer implementations for more control?

I imagine this is a bit of a chicken and egg situation going forward though. As developers won’t use 64bit integers on GPU’s due to performance problems, so nvidia won’t improve 64bit integer performance as it’s not being targeted by anyone. I imagine it’s only people like crypto researchers and the like that do currently and that they are a very small minority influence in the grand scheme of things.

Personally I would really like to push things so we had stellar 64 and 32 bit integer performance with intrinsics that encourage 64bit wide SWAR techniques etc.

The trend of improvements regards float16 that seem to be being pushed through currently from neural network applications and will be useful in the gaming sector are great, but it seems that integer performance will continue to be left by the way side for a long time yet?

My hobby CUDA projects revolve around this topic so you gave me an excuse to post;

https://sites.google.com/site/cudapermutations/

Yes 64 bit integer math is slow on consumer GPUs, but both of the above problems do a great deal of 64 bit integer math and still have great performance. Those benchmarks are a bit old and a single GTX 1080 is about 20% faster for those applications than the GTX Titan X.

Also your CUDA-Z numbers seem a bit low;

http://imgur.com/lkN6by3

are you on Windows or linux?

Yes, 64-bit arithmetic is accomplished via instruction sequences generated by the compiler (on all current CUDA GPUs). There is no native 64 bit integer add or multiply instruction.

For 32-bit integer operations, in some cases (e.g. integer add) the relative throughputs should be discoverable/estimatable from the published throughput specifications:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions

Transistors go where they are needed. It would be nice to have “stellar” 64-bit performance, but the opportunity cost must be weighed against other alternatives. Invariably, fast 64-bit integer never wins.

Modern graphics does not require 64-bit integer. Since that creates a home for >90% of the GPU die produced, this will be the dominant factor. If we look at current compute application targets (e.g. HPC, scientific/technical computing, machine learning/deep learning) none of these require any significant 64-bit integer throughput. FWIW, modern graphics also do not require 64-bit floating point either, so the vast majority of GPU die produced by NVIDIA have approximately zero double-precision floating point throughput.

If it were me, I wouldn’t hold my breath waiting for 64-bit native integer performance in GPUs.

In at least one case, I was able to significantly improve an integer-bound code by moving some indexing calculations to the floating point units, by doing an exhaustive proof that the indexing calculation would fully fit within the 23 bit mantissa, including intermediate results. This reduced integer pressure enough to produce something like a 2-3x speedup. Unlike FP32 and FP64 units, all integer arithmetic (i.e. 32 and 64 bit multiply and add) go through the same 32-bit integer units. Since these units may have 1/6 or less of the throughput of the floating point units, for an integer bound code it can be a win to move arithmetic off the integer units.

I am curious as to what the use case is. If possible, could you please expand and clarify? Maybe alternative solutions are possible that don’t require 64-bit integer operations.

I take it that the integer operation performance issue is limited to 64-bit integer operations? Because as far as I can see, a GTX 1080 should easily beat even top-of-the line CPUs for general 32-bit integer work by about a factor of five. Intel CPUs do offer many specialized instructions (for cryptography, for example) that likely give them an edge over GPUs where those are used.

As txbob alread explained, the development of GPU products is market driven, and there simply does not seem to be sufficient market demand for faster 64-bit integer operations on GPUs. The GTX 1080 throughput for general 64-bit integer math seems to be roughly on par with current CPUs. Even so, it does not have to be a performance bottleneck. For example, from what I understand, the molecular simulation software AMBER uses 64-bit integer (fixed-point, really) in conjunction with single-precision floating-point computation, achieving significant speed-up versus CPUs.

Also worth considering is the fact that even though 64 bit integer math is emulated on GPUs, and the GPU clock speed is less than CPUs, you can get a much larger number of concurrent computation threads consuming their subset of the problem space than is possible using a multi-core high-end CPU.

Mixed precision is a great approach, particularly when it comes to simulations. That approach tends to use 32-bit floats for 3D position and movement data, and 64-bit values for energy accumulation.

My GTX1080 is a Gigabyte G1 Gaming:

Interesting that it appears a little slow compared to that other figure. I wonder what is going on there - maybe I didn’t have it hitting its boost. It’s on a desktop with a i7 3770K, 32GB RAM, Windows7 64bit.

On the rest. Yeh I totally get due to cost etc transistors have to be carefully allocated where they will have the most current benefit. I guess it would be easier to push for cases to get int32 performance close to par of float32. When it comes to my gaming related use cases I don’t care about double related performance - and actually for gaming cards I would be quite happy if they completely stripped support for doubles out completely for improved integer performance - from my understanding in the GTX1080 they keep bare minimum double support but really it will never be used for gaming.

Trying to fudge integer calculations into floats just seems undesirable all round in terms of a software development path - but I am sure there are some gains to be had by doing so.

My use cases range from general game code, to more specialised stuff I can’t disclose currently but that relies a lot on ‘chess programming’/‘bit twiddling’ optimisations on the CPU for performance. So 64bit integers are a would be performance improvement and a convenience that can simplify, and help communicate with existing CPU code for that. The lack of integer support in vanilla AVX was initially a setback for example.

At the high end on PC gaming we may be approaching a more common situation where we have an Intel AVX2 CPU + iGPU + high end discrete GPU + older prior high end discrete card as a viable platform (Using DX12/Vulkan etc to best utilise it all). So deciding what should be offloaded where is an interesting problem although the main GPU in the short term definitely will still be saturated with graphics work.

When it comes to general game and simulation code I think progress has been fairly disappointing in terms of scale, and much more could be done there - and that’s definitely not all float work.

So it seems like a strategy that pushes int32’s instead is the better option for now - and hope any resulting products are successful enough to give them a reason to start looking at improving int32 performance as a first step.

Point taken that even though the int32 performance doesn’t match float32, it still can be much faster than the CPU.

I do wonder though if we will see a resurgence of fixed point - as for a lot of things in games 64bit fixed point may actually be more desirable going forward. I have no idea if it is still the case when it comes to hardware implementations if integer units are more transistor/cost effective compared to floating point units but I imagine so? In which case maybe that path could also help fight against the current roadblocks to Moore’s law.

kepler family members have 5/6 for integer adds 1/6 for integer multiply. Maxwell has full rate integer add, but backed off from multiply.

I’ve often thought integer multiply would be a useful benefit across the board (i.e. in all codes) for index calculations, but I guess not.

That is essentially already the case. The small residual amount (3%) would not move the needle enough elsewhere in terms of transistor budget, and it provides a consistent API (binary support) across the family, which has benefits.

[@txbob: I think you meant to write ABI, not API]. Making the feature set of each compute capability a strict superset of the feature set of lower compute capabilities (“onion layer” model) is crucial for any sane software development, so always offering at least minimal double precision (and going forward, half precision) support is a tax to be paid for that. As txbob states, the tax is quite small at present.

From the comments I assume your use case deals mostly with bitboard techniques that use 64 bits to represent chess boards, see for example http://chessprogramming.wikispaces.com/Bit-Twiddling. It should be noted that while GPUs do not offer the plethora of bit-twiddling instructions added to CPUs over the past decade, they do offer some important building blocks: count leading zeros, byte permutation, funnel shift, population count, bit reversal, LOP3 (implements any logical function with three inputs).

Note that CUDA-Z bases its integer performance ratings on integer multiplies which to my knowledge do not feature heavily in bitboard algorithms. For simple logic and arithmetic the 64-bit integer performance of a GPU is typically about 1/3 of the 32-bit integer performance (emulation of each 64-bit operation requires two or three native 32-bit operations), so overall the performance of a GTX 1080 may still be ahead of even fast CPUs. One would have to look at specific algorithms for a more accurate assessment.

There is one case where I expect there could be an increase in popularity of integer instructions: fixed point arithmetic. In technical/scientific computing there is value in deterministic algorithms especially with increasing parallelism and hardware complexity. A good examples is the aforementioned AMBER simulation package and their deterministic reduction algorithms. Although they are take to the almost religious extremes (“If It’s Not Reproducible, It’s Not Worth It!” is the title of Scott LeGrand’s talk :P), there is a point to some degree of determinism in complex parallel codes or at least essential parts of them (e.g. reductions, load balancer implementations)
and to accomplish that fixed point arithmetic is great.

The main reason we don’t use fixed point arithmetic for now is simply because i) 64-bit is useless ii) it gets crippled regularly and it may get even worse than it is.

Of course, as @cybernoid points out, this is somewhat of a chicken-egg problem, but at least on CPUs things are getting better, e.g. with AVX2 on Skylake you now get 2 vector instruction/cycle for most integer instructions (though AFAIK only with 32-bit operands). Also AVX-512F supports 512-bit integer instructions, though I’m not sure what’s the throughput (but I could imagine that it’s at least one if not two per cycle).

while we are on similar topic - i wonder why nvidia can’t provide mul24 or mad24 instructions? seems they can be implemented on existing fp32 hardware and will be more useful than existing mad16 instruction

Good point, I have been wondering about that too! It would be so much more efficient to offload integer ops to mul24/mad24 rather than tinker around with doing them if single precision floating point.

Hello,
Has anybody finished this? I would like to use this to create a prime miner for GPU’s.
Specially since many cryptocurrency use prime calculations to mine coins.

Maxwell and Pascal have a 16-bit integer multiply unit, accessed via the machin instruction XMAD. 32-bit integer multiplies are emulated using XMAD on these architectures. You might find the following paper interesting:

N Emmart, J Luitjens, C Weems, C Woolley, “Optimizing Modular Multiplication for NVIDIA’s Maxwell GPUs”, 23nd Symposium on Computer Arithmetic, (2016), pp. 47-54

If you are happy with just 24 bit integers, can’t you use floats in your code?
Overflow behaviour would obviously need some consideration.

Speaking about using the FP ALUs for emulating integer multiplications. Do you think it’s feasible to generate Instruction Level Parallelism between the DP and FP ALUs, provided that one uses registers that have no direct data dependence among each other?

Could we even achieve 3-way ILP between integer multiplication, FP and DP multiplication?

Christian

3-way ILP isn’t possible because the scheduler can only dual-issue. Dual issue of integer and FP isn’t possible since those operations use the same ALU functional units. But even dual issue of FP and DP is unlikely simply because the operand collector can only read 3 registers per clock so it couldn’t FEED a dual-issue even if it were possible. Though, in theory, if both FP and DP both used SASS .reuse on two of their three arguments, there would be enough register bandwidth… but I would not expect support for this very limited and very rare opportunity.

That depends on the kind of ILP we are aiming to expose. If we are talking about overlapping integer, SP and DP precision within the latency of each other, that that is most likely possible.

Within the same cycle dual issue is the most that can be achieved, as a maximum of two instructions is issued per cycle per warp.

If we are talking about overlapping SP and DP FMA in the same cycle to achieve FLOPS values beyond those specified by Nvidia I have recently been contemplating that, but come to the conclusion it’s unlikely to be possible because the real limit is on register file bandwidth, and it’s unlikely Nvidia would have gone to the length of building in the additional data paths necessary and then forgetting about it when claiming peak FP performance.

If you are only interested in multiplication, no addition (thus immediately giving up half of the claimed peak FLOPS value), then dual-issuing a SP mul and an integer mul or even a DP mul and an integer mul appears possible. That would take you to to 192+32=224 multiplications per SM per cycle on Kepler . Maxwell and Pascal need multiple instructions for one integer multiplication, so won’t gain as much over just using SP mul, but those instructions should still be possible to be dual issued with single mul.

Whether there is enough register bandwidth to dual issue SP and DP multiplication on Kepler to get to 192+64=256 multiplications per SM per cycle would need experimental exploration. I might do that experiment if I find some time for it. I haven’t done it so far as I am almost exclusively interested in balanced mul/add throughput, where FMA buys you more than bang/buck.

Hi Guys any progress? Im giving out a bounty for anybody who will assist me

Progress with what?

Tera, Could i contact you directly regarding a project to create a prime calculator with nvidia?