Matrix Multiplication Using Register Caches and __shfl?

I was currently working on matrix-matrix multiplication using cuda and learnt that register-level tiling can be used for high-performance matrix-matrix multiplication.
It’s claimed to be even faster than shared memory multiplication algorithms.

Is there any sample code I can look online for it?

here is a good writeup

Thank you so much!

Is there any sample code online I can look at? I found samples for shmem, but not for register-level tiling

There is “representative” code (a mixture of C++ and PTX) in the article I already linked.

To get the level of control they desired to achieve the objectives they had, nervana/scott gray started by creating their own assembler. The sgemm sample code associated with that effort is here (AFAIK) but the guts of it are not in CUDA C++, they are in SASS (GPU assembly) code for the reason already stated: control.

Careful control of register usage is not something that you can specify in CUDA using either C++ or PTX bindings (if you want to argue the PTX case; fine. I won’t take up that argument.) And the NVIDIA provided tool chains don’t give you any ability to do SASS level code development (i.e. assembly code development). So if you’re looking for a sample code in CUDA C++ that does this kind of register control, you simply won’t find it.

Also note that the GPU designs are starting to dedicate some of their “budgets” to TensorCore and if you can use them, the “fast path” for matrix-matrix multiply will almost certainly be through TensorCore. The latest Ampere TensorCore units support 16-bit, 19-bit, and 64-bit (and other) accelerated matrix multiply paths. These are not programmed in a fashion that is similar to the way the “usual” matrix-multiply is done (e.g. what Scott Gray did). Instead, the best reference I can suggest for that is CUTLASS. You will also find non-TensorCore paths in CUTLASS, so if you want to study high-quality work, that will also be a good resource. None of it is in PTX or SASS AFAIK.