Matix Multiplication using __shfl?

Seems like it Matrix Multiplication would be a good application of __shfl instructions, but I’ve been unable to find example code.

Has anyone seen any example code?

For broadcasting your outer product rows out among the threads, shared float4 (LDS.128) loads have double the throughput. Plus you can leverage shared memory as a double buffer: load one set of rows from device memory, while computing on the previously loaded rows.

Have you seen example code online?

The cuda-c programming guide has a pretty simple shared memory implementation of MM. I document the more advanced approach that cublas uses here:

Then you can find my current open source implementations of gemm here:

Here are some independent benchmarks:
These are generally outperforming cublas, particularly when one of your dimensions is small (like a minibatch in machine learning). I’m working on a new set now that are faster still and getting very high utilization with small tiles (like 16x64 or 32x32) and low occupancy.

I’ve implemented templated matrix classes supporting up to 32x32 size, complex valued. Matrix columns are distributed across the threads of a warp. All loops are fully unrolled and hence the entire matrix stays in registers (unless register pressure is too high). We can do multiplication and inversion, and yes we use a lot of warp shuffles.

A warp can hold one 32 wide matrix, or multiple smaller ones. Due to restrictions in the shuffle instruction, the matrix stride (spacing between matrices) has to be a power of 2. So any non power of 2 matrices will require some padding.

The biggest problem I have not been able to solve to my satisfaction is transposition. We’ve had to use shared memory because I could not find a way to use warp shuffles efficiently for a matrix transpose. A thread describing my conundrum is here:

Small matrices still have a lot of applications in engineering, such as signal processing for antenna systems. And being able to do all of this without global memory access allows us to get excellent performance (the arithmetics code is compute bound, not memory bound)



I’m curious what your typical matrix dimensions are, and what your compute pipeline looks like. I’m not quite understanding your lack of need for global memory access. Obviously you need to load the problem into the registers initially from global memory, but isn’t it more efficient to be doing that a little at time as you’re computing your outer products?

I’m starting to get really good now at small tile gemm implemntations, and I wonder if some of my techniques might map well to your problem. The gpu has limited local state and I find it best to leverage it in all forms instead of just 1. This means trying to max out L2, L1, shared memory and registers, and even the register reuse cache.


The main motivation for my choice of data layout was removing register pressure, especially for the Kepler type GPUs with its low register limit of 63. We found that that for matrices size 8x8 and beyond, register pressure was killing our performance: When all data was held by in a single thread, a single complex 8x8 matrix required 128 registers. And during inversion a lot more temporary storage was needed.

Our newer computation kernels still perform a lot of memory access, but the matrix arithmetic code is devoid of memory access (unless register spillage occurs, where some of that data will hit the L1/L2 caches). Another requirement for us is that we have to work on a lot of these small matrices in parallel, simulating multiple adjacent frequency bands or channels.

I am not really trying to advertise our approach here. I only replied because OP robosmith was asking for warp shuffles being used in matrix multiplication. Here’s some juicy code demonstrating this (an actual part of the class). CUDA 7.0 is required (with c++11 mode enabled: we need lambdas).

The code below should also work when replacing singlecomplex with float. We also have a version compatible with CUDA 5.0, which is using functors instead of lambda expressions - but that makes the code very awkward to read and maintain.

#define REF(x) &x
#define M_DEVICE __device__
#define DEVICE __device__

// loop unrolling code allowing template arguments used for unroll counts
template< int Begin, int End, int Step = 1>
struct Unroller {
    template<typename Action>
    M_DEVICE __forceinline__ static void step(Action &action) {
        Unroller<Begin+Step, End, Step>::step(action);
template< int End, int Step >
struct Unroller<End, End, Step> {
    template<typename Action>
    M_DEVICE static void step(Action &action) {

// warp shuffle overload taking singlecomplex (actually a wrapper around a float2 type)
DEVICE singlecomplex __shfl(const singlecomplex &sc, int lane, int width)
    singlecomplex result;
    result.value.x = __shfl(sc.value.x, lane, width);
    result.value.y = __shfl(sc.value.y, lane, width);
    return result;

struct is_power_of_two {
    ASSERT((S > 1) & !(S & (S - 1)), "You must use a power of 2 as matrix stride parameter");

 * Experimental new matrix class, with templated dimensions - meaning that this can
 * represent rectangular matrices, as well as column and row vectors as needed.
 * We apply unrolling over template arguments W and H by means of lambda expressions
 * and the recursive Unroller templates.
 * NOTE: matrix stride S must be a power of 2, as we use the warp shuffle intrinsics
 *       where S is passed as width to the __shfl function, supporting only powers of 2.
 * NOTE: matrix dimensions template arguments were changed from <int W, int H, ...>
 *       to <int H, int W, ...> to match common conventions. There is no power of 2
 *       restriction. W must be smaller or equal to S
template <int H, int W, int S> class newmatrix


    // Constructor etc. was omitted. Below follows the multiplication operator.

     * Fills the matrix object with zeros.
    M_DEVICE void clear() {

        is_power_of_two<S> test;
        auto clearer = [&] (int r) { Row[r] = singlecomplex::zero(); };
        Unroller<0, H>::step( clearer );

     * Operator: Multiply a HxW matrix with a WxW2 matrix.
     * NOTE: the resulting matrix is of dimension <H,W2>
    template<int W2> M_DEVICE newmatrix<H,W2,S> operator*(const newmatrix<W,W2,S> REF(b)) const {

        is_power_of_two<S> test;
        newmatrix<H,W2,S> result; // HxW * WxW2 = HxW2

        auto outer = [&] (int c) {
             auto inner = [&] (int r) {
                 result.Row[r] = result.Row[r] + __shfl(Row[r], c, S) * b.Row[c];
            Unroller<0, H>::step( inner );
        Unroller<0, W>::step( outer );

        return result;

    singlecomplex Row[H];  // uses row indexing 0...H-1, columns distributed across W threads
                           // several matrices within the same warp are aligned with stride S

OK, I see now why shuffles could be more efficient in this case. But I think there’s a still a chance you could design a carefully crafted shared memory layout to avoid many of the bank conflicts. Or maybe not.

As per my understanding Strassen algorithm for multiplying two matrices is the fastest algorithm having time complexity of O(n^2.8) instead of O(n^3) regular matrix multiplication algorithm using three loops.
Source :