Benchmarking for Peak Instruction Throughput example from Volkov's paper

Hey there,

I’m trying to replicate the peak register-to-register MAD throughput from Volkov’s paper (section 3.6).

The paper said 98% peak from

  1. 1 vector thread per core

  2. 64 elements in each vector

  3. each thread performs a group of 6 MADs (looped a million times)

I’m a little lost on how to begin the set up…

If I have 14 MPs then I’ll need a total of 14*64=896 elements to distribute? but if there is only 1 vector, do all 3 operands for the MAD come from that 1 vector? Do you assign 14 blocks and 8 threads per block to obtain 1 thread per core?

Here is the excerpt:

We were able to achieve 98% of the arithmetic peak in register-

to-register multiply-and-add instructions. This was achieved

running a single vector thread per core. In the test, each thread

performs a group of 6 independent multiply-and-adds a million

times in an aggressively unrolled loop. This is designed to hide

the pipeline latency even at a small number of threads per core.

The smallest vector length that yielded so high a fraction of

peak was 64 elements, i.e. two warps. We couldn’t achieve

comparable rate with shorter vectors even when running many

vector threads per core.