Apologies if this has been answered before but I couldn’t find a related post using the search.
I am trying to determine what the cost of SMEM to REG loads are?
I am trying to do a simple 2d convolution with both my image tile and my kernel in SMEM, and then prior to each MADD in the convolution I load one of the operands into a REG.
This is confirmed in decuda where my unrolled loop is full of mov, mad, mov, mad and so on.
Following Volkov and Demmel’s paper ‘Benchmarking GPUs to Tune Dense Linear Algebra’ they perform cycle counting using decuda output. Doing the same I have assumed that each mov takes 4 cycles and each mad takes 6 per warp (as in the paper), and that each mad is performing 2 floating point operations whereas the mov is not performing any.
If my loop were completely made out of mov and mad instructions then every 10 cycles the warp would perform 32*2 = 64 floating point ops.
A single multiprocessor (i.e. 8 ALUs) could perform a maximum of 10 cycles * 8 ALUs * 2 ops = 160 floating point ops in this time (assuming 2 ops per cycle, i.e. no extra MUL).
This would be 40% of peak performance (ignoring memory bottlenecks and other issues).
In the paper by Volkov and Demmel they suggest that the performance would be higher if the memory loads were done in parallel.
Are the memory loads done in parallel on the GPU or are they performed by the ALUs?
As far as I know, SMEM to REG moves can be performed in SFU pipeline in parallel with MAD on GT200 processors (GTX280, etc.) This is done, in part, to compensate the lack of the double precision MAD instruction (=FMA) with shared memory operand. This explains why DGEMM on GPU runs nearly at peak arithmetic throughput — all SMEM reads are done in parallel.
However, I never tried to benchmark single precision MAD with shared memory operand running in parallel with SMEM-to-REG MOV. So, I can’t tell how well MOV will be hidden behind the MAD. Ideally, it would be hidden completely and MOV/MAD would run in 6 cycles instead of 6+4=10 but there could be another bottlenecks that I’m not aware of.
So my understanding from what I have read and what you have stated here is that the GT200 series performs MAD (double precision) in parallel with MOV but the 8 series does not do MAD (single precision) in parallel with MOV?