Hey there,
I’m trying to replicate the peak registertoregister MAD throughput from Volkov’s paper (section 3.6).
The paper said 98% peak from

1 vector thread per core

64 elements in each vector

each thread performs a group of 6 MADs (looped a million times)
I’m a little lost on how to begin the set up…
If I have 14 MPs then I’ll need a total of 14*64=896 elements to distribute? but if there is only 1 vector, do all 3 operands for the MAD come from that 1 vector? Do you assign 14 blocks and 8 threads per block to obtain 1 thread per core?
Here is the excerpt:
We were able to achieve 98% of the arithmetic peak in register
toregister multiplyandadd instructions. This was achieved
running a single vector thread per core. In the test, each thread
performs a group of 6 independent multiplyandadds a million
times in an aggressively unrolled loop. This is designed to hide
the pipeline latency even at a small number of threads per core.
The smallest vector length that yielded so high a fraction of
peak was 64 elements, i.e. two warps. We couldnâ€™t achieve
comparable rate with shorter vectors even when running many
vector threads per core.