Hello,
I’m trying to optimize a kernel I wrote for the linear transformation of Serpent encryption algorithm. All my memory reads are coalesced, but I had a lot of bank conflicts with too many threads accessing the same bank in the shared memory.
I’ve decided I’ll increase the shared memory usage, and have a spare word (32bit) unused, per each thread of execution, just so that the threads wont access the same shared memory banks. It did solve that issue and the kernel now runs a bit faster, but it also had a negative impact. One I have no idea why it happened.
The GPU’s occupancy dropped to 67%. Nothing changed in terms of register count (10 reg/thread), and the occupancy calculator says it should be at 100%. If anyone out there has any idea, please let me know. Also, if anyone has a better idea on how I can solve my bank conflict issue, it would be much appreciated.
– Liad Weinberger.
Using GeForce 8600GT (CC = 1.1 afaik)
Threads per block: 256
Shared memory size: 5120 Bytes
Registers per thread: 10
The kernel, after fixing the bank conflict issue:
[codebox]global void sLTransKernel(unsigned int *d_data, unsigned int sz)
{
extern __shared__ unsigned int s_data[];
unsigned int i, gIdx;
// First of all, read 4 input words (32bit each) into shared memory , in a coalesced manner.
// Thread n will read words 0 + n, 256 + n, 512 + n, and 768 + n.
for (i = 0; i < 4; i++)
{
gIdx = blockIdx.x * blockDim.x * 4 + i * 256 + threadIdx.x;
if (gIdx < sz)
s_data[i * 320 + threadIdx.x + (threadIdx.x / 4)] = d_data[gIdx];
}
// Synchronize the threads.
__syncthreads();
// Now, transform the 4 words of input (totaling 128bit).
// Thread n will transform words 0 + n, 1 + n, 2 + n, 3 + n.
i = threadIdx.x * 5;
s_data[i] = (s_data[i] << 13) | (s_data[i] >> 19); // X0 := X0 <<< 13
s_data[i+2] = (s_data[i+2] << 3) | (s_data[i+2] >> 29); // X2 := X2 <<< 3
s_data[i+1] = s_data[i+1] ^ s_data[i] ^ s_data[i+2]; // X1 := X1 ⊕ X0 ⊕ X2
s_data[i+3] = s_data[i+3] ^ s_data[i+2] ^ (s_data[i] << 3); // X3 := X3 ⊕ X2 ⊕ (X0 << 3)
s_data[i+1] = (s_data[i+1] << 1) | (s_data[i+1] >> 31); // X1 := X1 <<< 1
s_data[i+3] = (s_data[i+3] << 7) | (s_data[i+3] >> 25); // X3 := X3 <<< 7
s_data[i] = s_data[i] ^ s_data[i+1] ^ s_data[i+3]; // X0 := X0 ⊕ X1 ⊕ X3
s_data[i+2] = s_data[i+2] ^ s_data[i+3] ^ (s_data[i+1] << 7); // X2 := X2 ⊕ X3 ⊕ (X1 << 7)
s_data[i] = (s_data[i] << 5) | (s_data[i] >> 27); // X0 := X0 <<< 5
s_data[i+2] = (s_data[i+2] << 22) | (s_data[i+2] >> 10); // X2 := X2 <<< 22
// Synchronize the threads.
__syncthreads();
// Now copy results back to global memory in a coalesced manner.
// Thread n will write words 0 + n, 256 + n, 512 + n, and 768 + n.
for (i = 0; i < 4; i++)
{
gIdx = blockIdx.x * blockDim.x * 4 + i * 256 + threadIdx.x;
if (gIdx < sz)
d_data[gIdx] = s_data[i * 320 + threadIdx.x + (threadIdx.x / 4)];
}
}[/codebox]