Technical questions on GTX1080ti multiplication

Hello!

Question 1:
At GTX1080ti, if i hammer nonstop multiply-adds 32x32 bits unsigned integers, is the next correct: at any given clock in a single SM exactly 32 can be executed to generate either the lowbits or the highbits. So all 128 cudacores which represent 4 warps, throughput wise not latencywise, it requires 8 clocks to get all 64 bits output from both the 32 highbits as well as the 32 lowbits. Is this correct assumption?

Question 2:
The L1 datacache of the GTX1080ti. If i run 9 warps of 32 cudacores on a SM. 1 warp eats 48KB L1 datacache, 8 warps eat each 6KB of L1 datacache. They run at the same time. This is a total of 96KB L1 datacache. Is that possible to allocate at the same time?

Does the L1 size allow this?
As i read here: http://docs.nvidia.com/cuda/pascal-tuning-guide/index.html
“GP104 provides 96 KB per SM.”

and over here:
https://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units

That says the GTX1080 has 96KB L1 yet what does the GTX1080ti have which is listed there as GP102?

That card is so expensive i’m not gonna buy it until i know for 100% sure it can do what i have on paper here. As otherwise it’s gonna be a Titan-Z.

This is about getting a rough understanding how much logic the chip has to do integer multiplication and L1 datacache. A chip is as good as its caches are. If signed integer multiplication is faster, even by 1 nanosecond, please mention that. Every picosecond counts. Not latencywise - throughputwise.

The chip can execute 3584 * 1.48Ghz (if clocked at that) = 5304.32G clocks per second.
I’m interested in knowing how many of those can contain highbits or lowbits for 32x32 bits unsigned integer math. I assume right now that’s 25% of the total clocks at any given time. Is that correct?

Many thanks in advance for relevant information.

background:
Summer 2016 i wrote a cool program to sieve prime numbers. It doesn’t do the sieve, yet is doing the cracking of the discrete logarithm with a cool algorithm in polynomial manner (as we know any sort of crypto you can prove to be cracked polynomial this one is O ( sqrt n) with the babystep algorithm). It’s on some old GPU (fermi) already 40x faster than a very fast assembly program that the math guys use right now to sieve written by a cool chap in the previous century and sincethen of course that optimal assembler for AMD cpu’s hardly has been improved (as that wasn’t possible - provable the maximum was achieved). 40x faster than a single core is here and i have mainly intel cpu’s here now (not so high clocked Xeons).

Now i’m looking towards newer gpu’s. When i tested the siever, which will get into open source, so no secrets there, i was bit worried it is only getting factor 2.5x faster at GTX980 compared to fermi. Kepler also scales better than gtx980 there - yet i guess has to do with faster caches of Kepler. Overall of course the higher SM count of the 980 wins there.

I’m used to write software that scales nearly 100% at supercomputers and PC hardware, not to mention gamerscards so will need some better testing at home to scale better.

Interested now in a GTX1080ti to get it going.

The above question is for a different program, a FFT using integers which we call NTT - numeric theoretic transform. I wrote those before for CPU’s. Paper crabbling so far indicates might be possible to squeeze about 1 Tflop double precision equivalent out of the 1080GTX - if it would have double precision enabled like Tesla. Though there could be disappointments with respect to the caches - which already occured for the GTX980 where i am missing performance of the caches. Note that is not the occupancy - as many simple integer instructions are there as well not counted in that 1 Tflop. The 1 Tflop is the innerloop. The GPU has way way more overhead than a CPU, thanks to larger L1 and L2 and L3 caches for the CPU that can seek and prefetch from random spots data. The total amount of Gflops is total uninteresting to me - useless instructions everyone can write who of course doesn’t write prime number software in that case - i care for the throughput time of the transform.

The Woltman assembler library for AVX which all the math guys use, which is a brilliant piece of very deeply optimized code, if we compare an i7-4Ghz with 4 cores at its best settings (at 4 cores it is faster than 4 individual cores embarrassingly parlalel) versus the GTX1080ti, i calculated that the first incarnation of the NTT i have on paper now, the GTX1080ti could be roughly 3.34x faster. That is the speedup we calculated.

All timings here are throughput timings. The actual time from start to finish is a lot lot longer, as many transforms and different kernels happen in parallel. Same is true for i7.

With big prime numbers we care about throughput not about latency.

Note we compare a much more efficient DWT on the i7 here with a generic NTT on the GPU without further optimizations yet to the limbs on the GPU yet (more efficient manner to calculate each “butterfly” type operation using modular integer arithmetic). DWT is a far faster and better form of carrying out a FFT for large prime numbers in the millions of bits such as Mersenne and Riesel and Proth and so on.

All this is theoretic calculations so far with respect to the NTT. No code has been written yet for the GPU by my side to complete this task and no rights nor claims can be done from anything i wrote so far.

undersigned,
Vincent Diepeveen

On Pascal and Maxwell 32-bit integer multiplies are emulated using the XMAD instruction. This also means that the number of instructions needed to extract the low 32 bits of the product is lower than the number of instructions needed to extract the high 32 bits.

You need three XMADs to generate the low 32 bits. I do not recall how many instructions are needed to generate the high 32 bits of the full product but it must be at least four, since XMAD provides a 16-bit multiply with 32-bit add. I would suggest disassembling relevant code with cuobjdump --dump-sass to find out.

I believe (but have not checked) that XMAD is a full throughput instruction that is executed at the same rate as FFMAs. If that is correct, to find the theoretical throughput divide the advertised SP FLOPS rate by six to get the throughput of 32x32->32 bit multiplies. GeForce 1080Ti is listed with 10609e9 SP FLOPS at base clocks, so that would give us a throughput of 1768e9 32x32->32 bit multiplies. What do you actually measure?

I assume you are aware of mfaktc and other existing programs for GPU-accelerated prime number search.

Hello, thanks for your quick reply. Let’s be clear here as i do not follow your logics so quickly.

Hah my chessprogram was called first year: diep slow intelligence (second year uni student you know)
Later when it got better it was called Diep :)

We got 128 cuda cores in each SM.
There is how many XMAD execution units in total at each SM?
If that’s 128 then your logics might hold.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#maximize-instruction-throughput

I see here indeed 128 mentionned for 6.1 and 6.2 compute capability as 32 bits float multiply.

32 xmads for the lowbits that’s very freakin good news if i may say so. As the GDDR bandwidth of the GTX1080ti isn’t used used fully yet. Yet you talk about 16 bits now suddenly and i see different numbers quoted there. 2 for 6.1 and 256 for 6.2

Pascal is listed there as 6.1 here https://en.wikipedia.org/wiki/CUDA

And in the instruction throughput table it says only 2 execution units on each SM that can do 16x16 bits?

I’m confused?

note that mfaktc is not a siever it’s doing trial factoring for Mersenne and Wagstaff (which i searched for a while as well as you can see on the list with finds) and has been written by Oliver Weihe.

My siever is a generic one that is for all Riesels and Proths without limitation in k (right now that’s 32 bits could make that also 64 bits). As it is pretty fast on a gpu i’m still searching for a cpu part to generate enough small prime numbers a second to keep it going strong.

Mersenne has k=1 in 2^n - 1 (which are Riesels) and the generic formula is: k * 2^n ± 1
Wagstaff is (2^n + 1) / 3

To sieve k * 2^n ± c with k unequal to 1, or k equal to 1 and c larger than 1, that requires cracking a discrete logarithm to sieve it. The cracking of the discrete logarithm happens in the GPU, that obviously is the slow part of my siever.

Edit: i do not get to your number.

It can execute 5.3Tflops worth of multiply-adds a second. They just count that as 2 flops each multiply-add, which is how they get to 10.6 Tflop.

5.3 Tflop / 7 = 0.76 T results 2x32 bits per second (both the high and lowbits of a 32x32=64 bits unsigned integer multiplication) at the default GTX1080ti clock.
If it requires 7 clocks in total throughput. That still is very good news if it is true as i count at 8 right now but i’m not sure about all this.

I don’t think that NVIDIA provides detailed information, but it seems plausible to assume that the units processing XMAD are the exact same units processing FFMA. Which in turn would mean that FFMA and XMAD have the same instruction throughput. The instruction throughput of FFMA is half the SP FLOP rate, because each FFMA represents two floating-point operations.

Seen another way, the GTX 1080Ti has 3584 “CUDA cores”, each issuing one FFMA per cycle, which at 1480 MHz base clock gives us a throughput of 5304e9 FFMAs/second, or 10608e9 SP GFLOPS. So if we assume that XMAD and FFMA have the same instruction throughput (and a quick benchmark seems to indicate that this is the case for sm_50), and knowing that one 32x32->32 bit multiply requires three XMADs, that would give us a throughput of 1768e9 32x32->32 integer multiplies per second. Which is one sixth of the SP FLOP rate as I stated above.

My assumption is that Pascal sm_61 behaves like sm_50 when it comes to integer multiplies, but not having access to such hardware I have no way of verifying that assumption.

As I said, getting the high 32 bits, or the full 64-bit product, will be more expensive (i.e. have lower throughput), and it is unlikely that any real-life code will get quite to the theoretical throughput due to secondary effects (register bank conflicts, decoder throughput, instruction cache, etc., etc.).

In any event, I think it is safe to conclude that a high-end GPU like the GTX 1080Ti will run rings around even a high-end CPU in terms of integer arithmetic throughput, contrary to the FUD some of Intel’s mouthpieces are trying to spread (I still read stuff like “GPUs are not designed for integer work”).

I really need the full 64 bits of every 32x32 bits multiplication. So i’m interested in the whole package.

You guess the gpu can deliver at 3584 cuda cores a throughput of 7 clocks for the full 64 bits at 3584 cudacores, do i say this correct, or do you expect i get dicked somehow for the highbits?

So every 7 clockticks there is 3584 times a 32x32 multiplication delivering 64 bits (throughputwise seen). Note it is an overpriced gamers gpu, not highend. Highend has a hard ECC requirement… …whereas for prime numbers we are more than happy enough at home with the checksum that GDDR5 generates which is more than enough :) The logics for ECC at a GPU is really a lot. Way way too much for a gamers gpu :)

Simple enough to find out. Set up a little benchmark that uses inline PTX to issue many mul.wide.u32 instructions and measure the performance. You can also look at the SASS with cuobjdump --dumpsass to check how efficiently that instruction is emulated.

Note that if you need to construct wider multiplies, use XMAD as a building block, not mul.wide.u32. I posted some example code a while ago on how to do that, will try to find it. [Later:] See here:

https://devtalk.nvidia.com/default/topic/1017754/long-integer-multiplication-mul-wide-u64-and-mul-wide-u128/

Right, benchmarking is something i need to do extensively for this NTT anyway - as it is figuring out how i can get good bandwidth out of the GDDR5 whereas some lookups need to be semi-random. So it’s a combination of getting the L2 good to work and streaming a long enough number of bytes at addresses not too far away from the previous GDDR5 read over that channel.

All that benchmarking is gonna eat loads and loads of time. That’s also what delayed the coding of the siever so much. All the endless tests i did do to get maximum speed out of the L1 cache.

In the end the algorithm i use needs less L1 cache than the original design on paper. Streaming certain things directly out and to GDDR5 was faster simply than via the L1 cache as its latency can get hidden. Well i should say that with caution, it is slower at Fermi actually, as Fermi seems to not be able to hide the GDDR5 latencies very well.

I found it interesting that running more than 8 warps of 32 cuda cores on each SM didn’t really speedup much at the framework used right now.

For the NTT coding, which requires all those multiplications, there is a hard limit in the number of warps i can run on each SM. That’ll be 8 or 9 warps, as they eat the entire L1 datacache size.

Still that is an open question how much GP102 has available as L1 datacache :)

the delay here will be the conversation with the bank account manager who is gonna tell me that getting a mortgage on my bicycle and wooden shoes to pay for that GP102 - GTX1080ti is gonna be complicated :)

Are you making maximum use of the register file? If not, you would want to try harder since this is the fastest memory available on the GPU. Have you looked into the explicit use of shared memory as an alternative to relying on the L1 cache?

Eight or nine warps per SM (256 or 288 threads per SM) seems too low to achieve good performance. I think it is entire possible you would see better performance by using on the order of 1K threads, even if that means thrashing the L1.

My recommendation: Do some experiments and let the CUDA profiler guide you to the bottlenecks. I see too many instances of programmers over-analyzing on paper in these forums (well, for my tastes). To get a good feel for any sort of machine, one has to use it, and often.

Yes Sir, the register file you must be very careful with. The compiler is not so clever from Nvidia. You have to reuse the same variables for example otherwise you lose another register and your program doesn’t even run at enough warps then. You lose registers very very quickly of the registerfile. They are very very valuable.

That does not match my experience. For at least the past four years the code quality produced by the CUDA compiler has been very good. One thing you have to consider is that any compiler uses a huge collection of heuristics, and those heuristics rarely provide an optimal result for any specific case. But they also rarely totally screw up the code and in general provide good results for the vast majority of source code thrown at them.

When skilled programmers “beat” a compiler it often boils down to the programmer using information that is not available to the compiler (either because the programming language or tool chain doesn’t allow to specify that information, or because there is a way to “tell” the compiler but the programmer did not use it).

What I meant by maximum use of the register file is this: If you multiply the number of concurrently-running threads by the number of registers used per thread, is the product close to the total number of registers available?

Yeah well YMMV if we talk about what the definition of a good compiler is. If i would be leading that compiler team first thing i would do is kick out all languages that are object oriented not to mention byte code or scripting languages of course. Imperative programming always is better of course if you want speed :)

Remember that the generic siever runs at different GPU’s. The NTT on other hand i will write specific for each architecture as the plan is right now.

I’d say: go give it a try and see what you can accomplish. That assumes of course that you are already a compiler engineer today. I am not a compiler engineer, but I worked (at times closely) with compiler engineers for about 15 years. My take-away: compilers are difficult.

As far as CUDA is concerned, I would boldly claim I have analyzed, in detail, more compiler-generated code than any one person on the CUDA compiler team. So I think I am somewhat qualified to judge the quality of the generated code.

There is always room for improvement in compilers: over long periods of time better compilers produce about 5% performance improvement per year on average. There is also a trade-off between compile times and the speed of the generated code. As host platforms become faster, compilers can spend more effort on optimizations.

I am looking forward to your GPU-accellerated prime-number search application and expect nothing but amazing performance.

I’m busy releasing a 3d printer so tad busy to be involved in those guys flown over from Russia who are always in those compiler teams :)

For me prime numbers is a big hobby. My computers over here put 24/24 time into that hobby. GPU’s can be amazing things yet not much good code is out there to be publicly used for prime numbers.

The sieve is working fine here yet it lacks all sorts of code to make it nice and easy to work with for everyone and i lack a good sieve on the CPU to feed it enough primes per second, that FAST are those gpu’s.

Didn’t decide yet which gpu to buy first though. A titan Z or GTX1080ti. That move from 8 clocks to 7 clocks for getting the full 64 bits of a 32x32 multiplication, if it is true, sure gets the GTX1080ti a tad higher on the list of cards to buy to get it to work on it!

At 980 series not to mention kepler a NTT is too slow for sure. Only double precision there is doing great. On the Titan Z the problem is not the number of double precision multiplications that can be executed, as it has enough gflops there. factors more than needed. The hard work there is to fit it within the 336.5, say practical user data 300GB/s bandwidth a second you have on each of the gpu’s. The GDDR5 bandwidth dominates everything on that GPU.

At the NTT for the GTX1080ti all sorts of optimizations are possible which makes it a very interesting project. In doubles you can store less bits effectively of the transform than in an integer transform.

Occupancy in a double is 16-18 bits per double. So that’s 25% occupancy.

In an all integer transform using P=2^127-1 it’s a whopping 43% occupancy. So the bandwidth gets used a lot more effectively from/to the GDDR5 ram to get the same thing done.

The difficult thing to get to work fast in both cases is the same L2 cache-fiddling algorithm that has to be invented. In that sense it doesn’t matter what gpu you go for. Yet my budget is not unlimited :)

Let’s sleep over it :)

Any thoughts on how much L1 datacache the GP-102 so GTX1080ti has available on each SM?
Also 96KB datacache like the GP-104 or do we still lose instruction cache from that?

Kepler-family GPU are actually quite fast for integer multiples as they can do a 32-bit integer multiply-add in a single IMAD instruction at one quarter of the native integer ALU throughput. Makes for nice dense code, too. What’s not there is the energy efficiency of Maxwell and Pascal.

I haven’t used a Pascal-family GPU ever, so by extension I haven’t run any microbenchmarks to tease out what the L1 size may be. If you can’t find the number in NVIDIA’s docs somewhere, that’s what you’ll have to do to get that piece of data. Honestly, I don’t really care knowing the exact cache metrics are (size, associativity, line length, exclusive/inclusive, replacement policy) as in my experience that doesn’t really help me write faster GPU code.

Over here critical to know. You need to do a bunch of semi random stuff - so you want the L2 to optimal profit from that so that it doesn’t waste bandwidth. And you can also choose whether you write to semi-random spots. So it is also critical to know all sorts of stuff regarding the GDDR5 like how many cachelines you can write to at the same time from each SM and all SM’s together, prior to the entire cacheline getting flushed to GDDR5.

All those details are total crucial to get a good performance out of the FFT/NTT - regardlesss whether it’s based upon double precision or based upon integers of 32-64 bits.

The important thing is always knowing what to not do :)

A lot of benchmarks are also part of the case there. An important feature of DDR3 is for example that after 32 bytes it can abort reading and then skip reading the rest of the cacheline. One of the major reasons why DDR3 practical is a lot faster than DDR2. For the same reason it is important to realize what the strong points are from GDDR5 :)