Warp shuffle instruction not working as expected

In CUDA, we generally prefer that adjacent threads in a warp load adjacent data from memory, for performance reasons. This falls under the general topic area of “coalescing” in CUDA.

For reasonable size problems, this means that adjacent threads will read along a row in a 2D matrix (as it is normally laid out in C++). We already know from your previous matrix-multiply codes that when dealing with the A matrix in the equation C = AxB, that we do indeed need for a thread to index along a row, in order for it to retrieve the elements of A that it needs to compute a single output element (i.e. to compute a single vector dot-product). This might possibly be amenable, in a limited way, to using warp-shuffle as a way to exchange A “vector data” between threads, perhaps eliminating the need for a tile storage for A such as the shared memory tile.

However for the B matrix (or tile) each thread needs to collect values from a column vector in B, i.e. each thread needs to index along a column. If we stick to our desire to load in a coalesced fashion, the warp shuffle does not immediately present a way, like it does for the A matrix, to allow for useful exchange of B vector elements between threads in a warp. The threads in a warp would be reading row vectors, not column vectors.

It might be possible to use a methodology similar to a register-based transpose to address this aspect of using values from the B tile, but this would reduce us from using a full complement of threads in the threadblock to only using a single “row” of threads. That doesn’t look very attractive to me, at first glance.

But if you did that, you could probably eliminate the use of shared memory altogether, for, say, 32x32 tiles, using only a single warp of 32 threads. This would address your items 1 and 3. Item 2 doesn’t make much sense to me, and doing that is not something that I would recommend for sensible or performant CUDA code, therefore I personally would not spend any time on that, even though it is basically trivial (and you have already implemented it, anyway).

For the case of your “toy” problem with 2x2 threadblocks, all the threads in that tile/threadblock would belong to a single warp, so it should be possible to come up with a shuffling pattern that would allow you to eliminate the use of shared memory tiles there.

If time permits I may try to write some code, but don’t have anything further to share at this time. Working on a 2x2 problem is useful if the concepts employed are readily extensible to larger sizes, but they would not be extensible directly beyond a 4x4 threadblock size, which isn’t interesting to me, and I doubt would be interesting from a performance perspective.