Determining latency and throughput for modular multiplication on CPU and CUDA

I need to determine both latency and throughput for (unsigned) modular multiplication in CUDA and on CPU (i5 750).

For the CPU I found this document, pg 121, for the Sandy Bridge, I am not really sure which one I should refer to, however for the “MUL IMUL r32” I get 4 cycles for the latency and reciprocal throughput equal 2. Then a “DIV r64” has latency 30-94 and rec.thr. 22-76.

Worst case scenario:

latency 94+4

rec.thr. 76+2

Right? Althought I am using OpenSSL to perform them, I am pretty sure at lowest level they always run simple modular multiplications.

Regarding CUDA, currently I am performing modular multiplications in PTX: multiplying 2 32b number, saving result on a 64b register, loading a 32b modulo on a 64b register and then do a 64b modulo.

If you look here, pg 76, they say throughput on Fermi 2.x for 32b integer multiplication is 16 (per clock-cycle per MP). Regarding modulo, they just say: “below 20 instructions on devices of compute capability 2.x”…

what does it mean exactly? Worst case 20 cycles per modulo per MP of latency? And throughput? How many modulos per MP each

Latency is fairly meaningless with a throughput architecture like the GPU. The easiest way to determine throughput numbers for whatever operation you are interested in is to measure it on the device you plan to target. As far as I know, this is how the tables are generated in the CPU document you referenced.

To examine the machine code, you can disassemble the machine code (SASS) for the modulo operation using cuobjdump --dump-sass. When I do this for sm_20, I count a total of sixteen instructions for a 32/32->32 bit unsigned modulo. From the instruction mix, I would estimate the throughput to be around 20 billion operations per second on a Tesla C2050, across the entire GPU (note that this is a guesstimate, not a measured number!).

As for the 64/64->64 bit unsigned modulo, which is a called subroutine, I recently measured a throughput of 6.4 billion operations per second on a C2050 using CUDA 5.0.

You might want to look into the algorithms of Montgomery and Barrett for modular multiplications, instead of using division.

I wrote that because I need to write down a mathematical model that estimate the GPU speed up for a specific algorithm (RNS Montgomery Exponentiation).
So far I would like to use the Amdahl’s Law (strong scaling), that calculates the maximum theoretical speed-up. Problem is comparing CPU workload with GPU workload.

Supposing to have k Modular Multiplication, if k = 34 for example, on CPU I will have 34*( 32b multiplication + 64b modulo).
On GPU 2.0 how could I estimate? I thought to do the following: a first warp is fully executed, so if our throughput is 16 32b integer multiplications per clock-cycle per SM, then we need to spend 2 cycles to execute the first 32 multiplication + another cycle for the remaining 2 mul (we dont take in account the modulo yet).
Is this correct? Or is there a better way to estimate it?
Moreover, if throughput is 16 int.mult. /clock/MP, is it correct to say that the i-th warp needing to execute at least 16+1 mul. will keep busy the MP for 2 cycles, even if there are other warps ready to be served?