How could warp shuffling be useful in matrix multiplication except for loading data to and from register variables?

A common answer to the question what is warp shuffling found on the Internet is

Warp shuffling is a technique to exchange values between two threads in a warp.

This definition is confusing at its best, as this is impossible to visualize.

What I understood from the texts available on the Internet is:

  • Warp shuffling is a technique to load data from a kernel argument (an array, a vector, or a matrix) to a register variable, thereby eliminating the need for using shared memory and being considerably faster than device memory.

Also,

  • There are some API functions in CUDA related to warp shuffling that allow us to -

    • load data from one variable to multiple variables (called broadcasting)
    • shift value from left to right or right to left (called shifting)
    • exchange values in a cross-fasion (called butterfly exchange)

However, I don’t see the utility of warp shuffling in matrix multiplication except for loading data to and from the register variable.

How could warp shuffling be useful in matrix multiplication except for loading data to and from register variables?

No, it isn’t. Warp shuffling is a technique to exchange data from one thread-local variable to another (formally: register in one thread to a register in another thread in the warp, at the SASS level). If you want, of course you can load data from a kernel argument (into a register) but that is separate from the warp-shuffle operation.

Sorry, I am not following. Draw one row of contiguous boxes. Below it, draw a second row of contiguous boxes. Now connect each box from the first row with an arrow to the appropriate box in the second row based on the specifics of the shuffle operation. Voila, we have visualized the shuffle operation. FWIW, there are plenty of such illustrations on the internet. Here is one example at the NVIDIA developer blog:

You missed the main question. :)

This is a discussion forum. It is not a Q+A site like Stack Overflow. Therefore it is perfectly legitimate for members of the community to comment on any aspect of your posting, down to a single statement or idea, without necessarily addressing your posting as a whole or providing a whole answer.

If you don’t want someone to comment on some aspect of your posting, don’t include that aspect in your posting.

Just because someone did not respond to some aspect of your posting does not mean they missed anything.

You may be able to work out the answer to the main question yourself after correcting the misconceptions stated in the first half of the original post. I would encourage some hands-on experimentation accompanied by reading relevant documentation to get a solid grasp of particular CUDA features before tackling bigger questions. That works well in my experience.