Hello!

Question 1:

At GTX1080ti, if i hammer nonstop multiply-adds 32x32 bits unsigned integers, is the next correct: at any given clock in a single SM exactly 32 can be executed to generate either the lowbits or the highbits. So all 128 cudacores which represent 4 warps, throughput wise not latencywise, it requires 8 clocks to get all 64 bits output from both the 32 highbits as well as the 32 lowbits. Is this correct assumption?

Question 2:

The L1 datacache of the GTX1080ti. If i run 9 warps of 32 cudacores on a SM. 1 warp eats 48KB L1 datacache, 8 warps eat each 6KB of L1 datacache. They run at the same time. This is a total of 96KB L1 datacache. Is that possible to allocate at the same time?

Does the L1 size allow this?

As i read here: http://docs.nvidia.com/cuda/pascal-tuning-guide/index.html

“GP104 provides 96 KB per SM.”

and over here:

https://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units

That says the GTX1080 has 96KB L1 yet what does the GTX1080ti have which is listed there as GP102?

That card is so expensive i’m not gonna buy it until i know for 100% sure it can do what i have on paper here. As otherwise it’s gonna be a Titan-Z.

This is about getting a rough understanding how much logic the chip has to do integer multiplication and L1 datacache. A chip is as good as its caches are. If signed integer multiplication is faster, even by 1 nanosecond, please mention that. Every picosecond counts. Not latencywise - throughputwise.

The chip can execute 3584 * 1.48Ghz (if clocked at that) = 5304.32G clocks per second.

I’m interested in knowing how many of those can contain highbits or lowbits for 32x32 bits unsigned integer math. I assume right now that’s 25% of the total clocks at any given time. Is that correct?

Many thanks in advance for relevant information.

background:

Summer 2016 i wrote a cool program to sieve prime numbers. It doesn’t do the sieve, yet is doing the cracking of the discrete logarithm with a cool algorithm in polynomial manner (as we know any sort of crypto you can prove to be cracked polynomial this one is O ( sqrt n) with the babystep algorithm). It’s on some old GPU (fermi) already 40x faster than a very fast assembly program that the math guys use right now to sieve written by a cool chap in the previous century and sincethen of course that optimal assembler for AMD cpu’s hardly has been improved (as that wasn’t possible - provable the maximum was achieved). 40x faster than a single core is here and i have mainly intel cpu’s here now (not so high clocked Xeons).

Now i’m looking towards newer gpu’s. When i tested the siever, which will get into open source, so no secrets there, i was bit worried it is only getting factor 2.5x faster at GTX980 compared to fermi. Kepler also scales better than gtx980 there - yet i guess has to do with faster caches of Kepler. Overall of course the higher SM count of the 980 wins there.

I’m used to write software that scales nearly 100% at supercomputers and PC hardware, not to mention gamerscards so will need some better testing at home to scale better.

Interested now in a GTX1080ti to get it going.

The above question is for a different program, a FFT using integers which we call NTT - numeric theoretic transform. I wrote those before for CPU’s. Paper crabbling so far indicates might be possible to squeeze about 1 Tflop double precision equivalent out of the 1080GTX - if it would have double precision enabled like Tesla. Though there could be disappointments with respect to the caches - which already occured for the GTX980 where i am missing performance of the caches. Note that is not the occupancy - as many simple integer instructions are there as well not counted in that 1 Tflop. The 1 Tflop is the innerloop. The GPU has way way more overhead than a CPU, thanks to larger L1 and L2 and L3 caches for the CPU that can seek and prefetch from random spots data. The total amount of Gflops is total uninteresting to me - useless instructions everyone can write who of course doesn’t write prime number software in that case - i care for the throughput time of the transform.

The Woltman assembler library for AVX which all the math guys use, which is a brilliant piece of very deeply optimized code, if we compare an i7-4Ghz with 4 cores at its best settings (at 4 cores it is faster than 4 individual cores embarrassingly parlalel) versus the GTX1080ti, i calculated that the first incarnation of the NTT i have on paper now, the GTX1080ti could be roughly 3.34x faster. That is the speedup we calculated.

All timings here are throughput timings. The actual time from start to finish is a lot lot longer, as many transforms and different kernels happen in parallel. Same is true for i7.

With big prime numbers we care about throughput not about latency.

Note we compare a much more efficient DWT on the i7 here with a generic NTT on the GPU without further optimizations yet to the limbs on the GPU yet (more efficient manner to calculate each “butterfly” type operation using modular integer arithmetic). DWT is a far faster and better form of carrying out a FFT for large prime numbers in the millions of bits such as Mersenne and Riesel and Proth and so on.

All this is theoretic calculations so far with respect to the NTT. No code has been written yet for the GPU by my side to complete this task and no rights nor claims can be done from anything i wrote so far.

undersigned,

Vincent Diepeveen