The problem with shuffle is that you would have to mix different i per thread. So either providing the value to the shuffle instruction or storing into the dst array would have a dynamic index. And resorting 8 values needs quite a lot of select instructions (around 8² = 64 more or less; regardless whether you do it with switch case or some tree-like swap; for modulo addition or for XOR indices).
If instead you put i as outer array dimension and do the simplest approach, you randomly get bank conflicts (instead of guaranteed 8x). The average number of bank conflicts is around 3.42x (even considering distinct numbers). So the 3x approach is still a bit faster (with shared memory bandwidth the limiting factor).
__global__ f(int* ptr1, int* ptr2, ...) {
__shared__ half src1[64][8][8];
__shared__ half src2[64][8][8];
half2 dst1[8];
half2 dst2[8];
half temp1a[8];
half temp1b[8];
half temp2a[8];
half temp2b[8];
int lane_id = threadIdx.x;
int group_id = lane_id / 4;
int tid_in_group = lane_id % 4;
load_matrix_x4(&temp1a[0], &src1[ptr1[(lane_id & 0x6) >> 1]][(lane_id & 1) + ((lane_id & 18) >> 2)][0]);
load_matrix_x4(&temp2a[0], &src2[ptr2[(lane_id & 0x6) >> 1]][(lane_id & 1) + ((lane_id & 18) >> 2)][0]);
load_matrix_x4(&temp1b[0], &src1[ptr1[4 + ((lane_id & 0x6) >> 1)]][(lane_id & 1) + ((lane_id & 18) >> 2)][0]);
load_matrix_x4(&temp2b[0], &src2[ptr2[4 + ((lane_id & 0x6) >> 1)]][(lane_id & 1) + ((lane_id & 18) >> 2)][0]);
for (int i = 0; i < 8; i++) {
half2 tempa{ temp1a{i}, temp2a{i} };
half2 tempb{ temp1b{i}, temp2b{i} };
half2 recvA = reinterpret_cast<half2&>(__shfl_sync(FULL_MASK, (lane_id & 4) ? tempb : tempa, ((group_id & 4) ? 4 : 0) + 8 * tid_in_group + (group_id & 3)]));
half2 recvB = reinterpret_cast<half2&>(__shfl_sync(FULL_MASK, (lane_id & 4) ? tempa : tempb, ((group_id & 4) ? 0 : 4) + 8 * tid_in_group + (group_id & 3)]));
recvX = half2{recvA.x, recvB.x};
recvY = half2{recvA.y, recvB.y};
dst1[i] = (group_id & 4) ? recvY : recvX;
dst2[i] = (group_id & 4) ? recvX : recvY;
}
}
This is how ldmatrix and shuffle can be combined for only 2x the shared memory bandwidth.
To do it, the data for 2 matrix operations (but potentially different ptr and different src) have to be prepared by the warp at the same time. So it should not be a huge disadvantage.
We use 4 * 4 = 16 ldmatrix and 8 * 2 = 16 shuffle, so 32 shared memory operations. Or 16 per matrix preparation.
(code not tested)
Thanks! I will try it.