CUDA Shuffle Instruction (Warp-level intra register exchange)

I am trying to design an efficient matrix transpose by leveraging the CUDA 3.0+ shuffle instruction feature. The CUDA C Programming Guide lists that shuffle should be used as follows.

int __shfl(int var, int srcLane, int width=warpSize);

There is no indication whether “var” can be applied to a per-thread static array.

Suppose I have four threads.

  • I want thread 0 to read src_registers[0] of thread 0
  • I want thread 1 to read src_registers[1] of thread 0
  • I want thread 2 to read src_registers[2] of thread 0
  • I want thread 3 to read src_registers[3] of thread 0

The following code instantiates the above pseudocode.

int WIDTH = 4
    int src_registers[4];
    int dst_registers[4];

    int tid = threadIdx.x % 4;
    dst_registers[0] = __shfl(src_registers[tid], 0, WIDTH);

However, it does not work as expected. dst_registers for each thread returns only one value of the broadcasting thread. Does shuffle stipulate that the “var” reference must be the same variable reference for all threads (e.g. threads can only exchange at src_registers[i] per shuffle instruction)?

Code in the context of a transpose kernel is shown below. Note: this kernel does not work as intended as I don’t think shuffle is meant to be used this way.

#define WIDTH 4
void __global__ shuffle_test(int *input_ary, int *output_ary)
    int idx = threadIdx.x + blockDim.x * blockIdx.x;
    int stride = blockDim.x * gridDim.x;
    int src_registers[4];
    int dst_registers[4];

    input_ary += idx;
    output_ary += idx;
    for (int i = 0; i < 4; ++i)
        src_registers[i] = input_ary[stride * i];

    int tid = threadIdx.x % 4;
    dst_registers[0] = __shfl(src_registers[tid], 0, WIDTH);
    dst_registers[1] = __shfl(src_registers[tid], 1, WIDTH);
    dst_registers[2] = __shfl(src_registers[tid], 2, WIDTH);
    dst_registers[3] = __shfl(src_registers[tid], 3, WIDTH);

    for (int i = 0; i < 4; ++i)
        output_ary[stride * i] = dst_registers[i];

One issue is that no opcode supports true dynamic “indexing” of registers within a lane. It would be great if that feature existed.

Because of this, the src_registers[0-3] will be stored to “local” memory (which is indexable) and then read once to obtain src_registers[tid]. That won’t be very performant.

It’s probably not doing what you had hoped.

Over on another thread in this forum I describe a transpose-using-SHFL approach.

There are quite a few ways to avoid using shared memory but… it’s tough to beat the performance of the shared approach.

Here is a list of low-level features that you can mix-and-match to build an alternative transpose kernel:

  • writing 16-byte v2/v4 types to simplify inter-lane word movement [*] explicit shuffling [*] rotations via shuffling [*] implicit shuffling on store [*] implicit shuffling on load [*] maximizing use of SELP: d = pred ? a : b; [*] thinking in terms of the non-existent opcode MOVP: if (p) a = c; else b = c; [*] exchange operations -- swap two registers: XCHG: t = a; a = b; b = t; [*] etc.
  • I’ve actually tried quite a few combinations on the K20c but still haven’t beaten the throughput of the ultra-simple shared approach (yet). :)

    Interestingly if you compile that kernel with “-Xptxas=-v,-abi=no” it will avoid generating the local stores+load and instead will assume tid=0 and explicitly test for each remaining case: tid=1, tid=2 and tid=3 and if there is a match move the proper value into the destination register.

    Unfortunately, ptxas incorrectly reports that 16 bytes of local mem are being used despite no STL/LDL ops.

    Hi Allan,

    I’m fine if the transpose kernel listed in the original post is translated to local memory. I’m interested in functional correctness first (and then performance). Even if the shuffle instruction is spilled into local memory, shouldn’t the program proper index the src_registers[tid] array?

    The behavior when I run this kernel is that it defaults to src_registers[k] for all threads (which is the unintended behavior).

    Yes, the local memory src_registers[tid] accesses will be entirely correct. Just noting that they’re dynamically selecting a value and therefore will be slow.

    But SHFL(src,0,4) will result in all 32 4 lanes fetching the val from each sub-warp’s lane 0. I don’t think that’s what you want.

    Also, I don’t understand the intent of the stride calculation and am probably missing what’s supposed to happen in the SHFL sequence… so I’m probably just making things more confusing.

    Hi allanmac,

    in your blog postings you describe two approaches for transposing tiles without using shared memory. I was wondering could also be making some related code available. I am quite new to the shfl() business, and some example code could really help me.

    What I am trying is to get rid of shared memory in cudaminer - I’ve found out my memory operations are limiting the performance on my GTX 780 Ti. So I want to redesign this part completely.

    EDIT: I did just find your repo here, but it does not contain anything about matrix transpose.

    I never posted the transpose code on github. The two blog posts describe the best of the large number of approaches I tried. There were 5 different shfl/xchg/slct approaches and 9 different tile “shapes” so it was a lot of code that I would’ve had to clean up and post. :)

    Regarding getting rid of shared memory, a good starting point is thinking about a warp or block’s maximum register set as one large slab of memory that is very fast but can’t be easily indexed. On sm_3x, there are 64K 32-bit registers and only 12K 32-bit shared words. That’s where shuffling, finding symmetries and other techniques become useful because they can let you minimize shared memory use and tap into the relatively large register set.

    Hi Allan,

    I tried to follow the procedure described in your blog posting here

    The code seems to work (I tried it with two logical tiles in a thread block). It wasn’t so hard after all!

    __device__ uint4 __shfl(uint4 val, unsigned int lane)
        return make_uint4(
            __shfl((int)val.x, lane),
            __shfl((int)val.y, lane),
            __shfl((int)val.z, lane),
            __shfl((int)val.w, lane));
    __device__ void __swap(uint4 &a, uint4 &b)
        uint4 t = b; b = a; a = t;
    __global__ void testKernel(uint4 *c)
        uint4 B[4]; // 16 registers
        int lane4 = laneId%4;
        int base = laneId/4;
        // rotate
        B[1] = __shfl(B[1], base*4+(lane4+3)%4);
        B[2] = __shfl(B[2], base*4+(lane4+2)%4);
        B[3] = __shfl(B[3], base*4+(lane4+1)%4);
        // exchange
        if (lane4 >= 2) { __swap(B[0], B[2]); __swap(B[1], B[3]); }
        // select + write
        c[16*base+    lane4     ] = (laneId % 2 == 0) ? B[0] : B[1];
        c[16*base+4 +(lane4+3)%4] = (laneId % 2 == 0) ? B[3] : B[0];
        c[16*base+8 +(lane4+2)%4] = (laneId % 2 == 0) ? B[2] : B[3];
        c[16*base+12+(lane4+1)%4] = (laneId % 2 == 0) ? B[1] : B[2];

    Thank you for the enlightening blog postings. Keep 'em coming!

    Now I have to adapt the code to be useful in my scrypt hashing implementation. I will need a corresponding global memory read function as well.

    Cool! I didn’t use uint4’s but after you asked about the transposition code and after @njuffa mentioned his use of 256-bit u64x4’s I started thinking it would be cleaner to use vector types… and your code looks much cleaner than mine.

    One other trick is to coax SHFL into “rotating” the entire warp of registers without using an explicit REM or AND op. I wrote about that here.