I am devising a kernel that requires vector rotation. More specifically, I am looking for an efficient way to do the following (simplified senario):
Each thread has the same copy of a vector:
v0, v1, v2, …, v63
A rotation rotates these registers within the thread:
v63, v0, v1, …, v62
Let’s say variables a0 to a63 hold v0 to v63. Then
a0 = v0, a1 = v1, …, a63 = v63
a0 = v63, a1 = v0, …, a63 = v62.
So the algorithm works as follows:
(Each thread has the same vector but different data)
- Read in v0 to v63 as a0 to a63.
- Do some calculation with existing data.
- Rotate a, then return to step 2.
There will be 63 rotations, because the 64th rotation is just the original vector. I don’t think unrolling the loop for 64 times is a good choice. The main calculation is non-trivial and the resulting code would be too long.
This pseudocode shows how I currently get this done, through shared memory:
// suppose blockDim = 64 __shared__ float v; // coalesced read from global memory v[tid] = vec[tid]; __syncthreads(); float a; for rot = 0 to 63: // broadcast from shared memory for i = 0 to 63: a[i] = v[i]; // do some really complicated operations with data and a // here += is only for demostration for i = 0 to 63: data[i] += a[i]; __syncthreads(); // each block reads in v with an offset, no bank conflict int index = (tid + 63) % 64; float x = v[index]; __syncthreads(); // then writes back v[tid] = x; __syncthreads();
Question: Is it possible to rotate more efficiently?
Actually, to reduce register pressure, a vector can even be split to 2 or even more threads.
For example, two neighboring threads have the same copy of a vector (v0…v63):
thread 0: v0 v1 v2 … v31
thread 1: v32 v33 … v63
thread 2: v0 v1 v2 … v31
thread 3: v32 v33 … v63
Rotate registers among two threads:
thread 0: v63 v0 … v30
thread 1: v31 v32 … v62
thread 2: v63 v0 … v30
thread 3: v31 v32 … v62
And in my application, the vector dimension is around 128, and the rotation is a varied one. I use Maxwell cards.