I am devising a kernel that requires vector rotation. More specifically, I am looking for an efficient way to do the following (simplified senario):

Each thread has the same copy of a vector:

v0, v1, v2, …, v63

A rotation rotates these registers within the thread:

v63, v0, v1, …, v62

Let’s say variables a0 to a63 hold v0 to v63. Then

a0 = v0, a1 = v1, …, a63 = v63

becomes

a0 = v63, a1 = v0, …, a63 = v62.

So the algorithm works as follows:

(Each thread has the same vector but different data)

- Read in v0 to v63 as a0 to a63.
- Do some calculation with existing data.
- Rotate a, then return to step 2.

There will be 63 rotations, because the 64th rotation is just the original vector. I don’t think unrolling the loop for 64 times is a good choice. The main calculation is non-trivial and the resulting code would be too long.

This pseudocode shows how I currently get this done, through shared memory:

```
// suppose blockDim = 64
__shared__ float v[64];
// coalesced read from global memory
v[tid] = vec[tid];
__syncthreads();
float a[64];
for rot = 0 to 63:
// broadcast from shared memory
for i = 0 to 63:
a[i] = v[i];
// do some really complicated operations with data and a
// here += is only for demostration
for i = 0 to 63:
data[i] += a[i];
__syncthreads();
// each block reads in v with an offset, no bank conflict
int index = (tid + 63) % 64;
float x = v[index];
__syncthreads();
// then writes back
v[tid] = x;
__syncthreads();
```

Question: Is it possible to rotate more efficiently?

–

Actually, to reduce register pressure, a vector can even be split to 2 or even more threads.

For example, two neighboring threads have the same copy of a vector (v0…v63):

thread 0: v0 v1 v2 … v31

thread 1: v32 v33 … v63

thread 2: v0 v1 v2 … v31

thread 3: v32 v33 … v63

Rotate registers among two threads:

thread 0: v63 v0 … v30

thread 1: v31 v32 … v62

thread 2: v63 v0 … v30

thread 3: v31 v32 … v62

And in my application, the vector dimension is around 128, and the rotation is a varied one. I use Maxwell cards.