How can I incorporate register caching and shuffling in my kernel?

The problem, roughly speaking, is described here. The A tile is readily amenable to replacement by shuffling because the values to be exchanged are loaded by adjacent threads in the warp, because the threads in the warp (for both A and B tiles) are loading values horizontally, i.e. row-wise across the warp. The values needed from the A matrix during the for-loop are arranged along a row of A.

But for the B tile, the values to be exchanged are columnar. You can’t do this simply with a shuffle op without redesigning the load pattern for the B tile. Redesigning the load pattern for the B tile could be done by having warps load vertically rather than horizontally, but this will convert a nicely coalesced load pattern in global memory to an uncoalesced load pattern. That is doubtful to be a performance win, so I personally didn’t invest any time in it.

1 Like