Optimizing bank conflicts - problem with occupancy

Hello,

I’m trying to optimize a kernel I wrote for the linear transformation of Serpent encryption algorithm. All my memory reads are coalesced, but I had a lot of bank conflicts with too many threads accessing the same bank in the shared memory.

I’ve decided I’ll increase the shared memory usage, and have a spare word (32bit) unused, per each thread of execution, just so that the threads wont access the same shared memory banks. It did solve that issue and the kernel now runs a bit faster, but it also had a negative impact. One I have no idea why it happened.

The GPU’s occupancy dropped to 67%. Nothing changed in terms of register count (10 reg/thread), and the occupancy calculator says it should be at 100%. If anyone out there has any idea, please let me know. Also, if anyone has a better idea on how I can solve my bank conflict issue, it would be much appreciated.

– Liad Weinberger.


Using GeForce 8600GT (CC = 1.1 afaik)

Threads per block: 256

Shared memory size: 5120 Bytes

Registers per thread: 10


The kernel, after fixing the bank conflict issue:

[codebox]global void sLTransKernel(unsigned int *d_data, unsigned int sz)

{

extern __shared__ unsigned int s_data[];

unsigned int i, gIdx;

// First of all, read 4 input words (32bit each) into shared memory , in a coalesced manner.

// Thread n will read words 0 + n, 256 + n, 512 + n, and 768 + n.

for (i = 0; i < 4; i++)

{

	gIdx = blockIdx.x * blockDim.x * 4 + i * 256 + threadIdx.x;

	if (gIdx < sz)

		s_data[i * 320 + threadIdx.x + (threadIdx.x / 4)] = d_data[gIdx];

}

// Synchronize the threads.

__syncthreads();

// Now, transform the 4 words of input (totaling 128bit).

// Thread n will transform words 0 + n, 1 + n, 2 + n, 3 + n.

i = threadIdx.x * 5;

s_data[i] = (s_data[i] << 13) | (s_data[i] >> 19);				// X0 := X0 <<< 13

s_data[i+2] = (s_data[i+2] << 3) | (s_data[i+2] >> 29);			// X2 := X2 <<< 3

s_data[i+1] = s_data[i+1] ^ s_data[i] ^ s_data[i+2];			// X1 := X1 ⊕ X0 ⊕ X2

s_data[i+3] = s_data[i+3] ^ s_data[i+2] ^ (s_data[i] << 3);		// X3 := X3 ⊕ X2 ⊕ (X0 << 3)

s_data[i+1] = (s_data[i+1] << 1) | (s_data[i+1] >> 31);			// X1 := X1 <<< 1

s_data[i+3] = (s_data[i+3] << 7) | (s_data[i+3] >> 25);			// X3 := X3 <<< 7

s_data[i] = s_data[i] ^ s_data[i+1] ^ s_data[i+3];				// X0 := X0 ⊕ X1 ⊕ X3

s_data[i+2] = s_data[i+2] ^ s_data[i+3] ^ (s_data[i+1] << 7);	// X2 := X2 ⊕ X3 ⊕ (X1 << 7)

s_data[i] = (s_data[i] << 5) | (s_data[i] >> 27);				// X0 := X0 <<< 5

s_data[i+2] = (s_data[i+2] << 22) | (s_data[i+2] >> 10);		// X2 := X2 <<< 22

// Synchronize the threads.

__syncthreads();

// Now copy results back to global memory in a coalesced manner.

// Thread n will write words 0 + n, 256 + n, 512 + n, and 768 + n.

for (i = 0; i < 4; i++)

{

	gIdx = blockIdx.x * blockDim.x * 4 + i * 256 + threadIdx.x;

	if (gIdx < sz)

		d_data[gIdx] = s_data[i * 320 + threadIdx.x + (threadIdx.x / 4)];

}

}[/codebox]

I assume 5120 bytes is the shared memory before introducing the extra word per thread?

Then one more word per thread equates to 1024 bytes per block, which means that only two instead of three blocks can run concurrently. That’s an occupancy of 16/24 warps or 67% instead of 24/24 warps (100%).

67% still is a very good occupancy. It is important to have at least 6 warps running (6/24 warps = 25% occupancy) to fully hide the latency of the pipeline (seems to have 24 stages). Anything above that only serves to hide global memory latency, which is less important if you have only few global memory accesses.

EDIT: Should have actually read your code - 5120 seems to be the shared memory including the extra word. How do you actually call the kernel in your program?

Hi,

Yes, 5120 is including the extra word per thread. The kernel is executed with:

dim3 blockSz(256);

dim3 gridSz(PADDED_DATA_LENGTH / 256);

sLTransKernel<<<gridSz, blockSz, (256*5)*sizeof(unsigned int)>>>(d_data, PADDED_DATA_LENGTH);

Ok, so something is unexpectedly eating about 1k of shared memory per multiprocessor. :mellow:

(EDIT: had written “per kernel” first, but actually meant “per multiprocessor”. The real value might be somewhere in between)

I think you can save a lot of the unused memory, though, and still avoid bank conflicts:

[codebox]

global void sLTransKernel(unsigned int *d_data, unsigned int sz)

{

__shared__ unsigned int s_data[4][256+4];

unsigned int i, gIdx;

// First of all, read 4 input words (32bit each) into shared memory , in a coalesced manner.

// Thread n will read words 0 + n, 256 + n, 512 + n, and 768 + n.

for (i = 0; i < 4; i++)

{

	gIdx = blockIdx.x * blockDim.x * 4 + i * 256 + threadIdx.x;

	if (gIdx < sz)

		s_data[threadIdx.x & 3][(i * 256 + threadIdx.x) >> 2] = d_data[gIdx];

}

// Synchronize the threads.

__syncthreads();

// Now, transform the 4 words of input (totaling 128bit).

// Thread n will transform words 0 + n, 1 + n, 2 + n, 3 + n.

i = threadIdx.x;

s_data[0][i] = (s_data[0][i] << 13) | (s_data[0][i] >> 19);			// X0 := X0 <<< 13

s_data[2][i] = (s_data[2][i] << 3) | (s_data[2][i] >> 29);			// X2 := X2 <<< 3

s_data[1][i] = s_data[1][i] ^ s_data[0][i] ^ s_data[2][i];			// X1 := X1 ⊕ X0 ⊕ X2

s_data[3][i] = s_data[3][i] ^ s_data[2][i] ^ (s_data[0][i] << 3);		// X3 := X3 ⊕ X2 ⊕ (X0 << 3)

s_data[1][i] = (s_data[1][i] << 1) | (s_data[1][i] >> 31);			// X1 := X1 <<< 1

s_data[3][i] = (s_data[3][i] << 7) | (s_data[3][i] >> 25);			// X3 := X3 <<< 7

s_data[0][i] = s_data[0][i] ^ s_data[1][i] ^ s_data[3][i];			// X0 := X0 ⊕ X1 ⊕ X3

s_data[2][i] = s_data[2][i] ^ s_data[3][i] ^ (s_data[1][i] << 7);		// X2 := X2 ⊕ X3 ⊕ (X1 << 7)

s_data[0][i] = (s_data[0][i] << 5) | (s_data[0][i] >> 27);			// X0 := X0 <<< 5

s_data[2][i] = (s_data[2][i] << 22) | (s_data[2][i] >> 10);			// X2 := X2 <<< 22

// Synchronize the threads.

__syncthreads();

// Now copy results back to global memory in a coalesced manner.

// Thread n will write words 0 + n, 256 + n, 512 + n, and 768 + n.

for (i = 0; i < 4; i++)

{

	gIdx = blockIdx.x * blockDim.x * 4 + i * 256 + threadIdx.x;

	if (gIdx < sz)

		d_data[gIdx] = s_data[threadIdx.x & 3][(i * 256 + threadIdx.x) >> 2];

}

}[/codebox]

In principle you could even regain the remaining 64 bytes of padding per block, but that’s probably not worth it.

As a further optimization, it’s probably worth to place X0…X3 into registers (if the compiler does not already do so). Operations on registers are still cheaper than on shared memory, particularly if both operands come from shared memory.

The 5120 is your own allocations ( 256 * 5 * sizeof( unsigned int) ) however there are other stuff being passed using the smem, such as tid, block id, your pointers in the kernel
method declaration,…
So if you put 5121 in the occupancy calculator you’ll indeed get 67% :)

eyal

Wow, that’s an impressive demonstration!

Haven’t actually used the occupancy calculator myself, usually just do it in my head (leaving just ‘a bit’ of spare shared memory). Might be worth looking into it to see how much shared memory is used otherwise.

Looking at your problem from a more global perspective: Why bother going through shared memory at all?

It is both simpler and faster without:

[codebox]

global void sLTransKernel(uint4 *d_data, unsigned int sz)

{

uint4 d;

unsigned int gIdx = blockIdx.x * blockDim.x + threadIdx.x;

if (gIdx < sz) {

	d = d_data[gIdx];

	d.x = (d.x << 13) | (d.x >> 19);		// X0 := X0 <<< 13

	d.z = (d.z << 3) | (d.z >> 29);			// X2 := X2 <<< 3

	d.y = d.y ^ d.x ^ d.z;				// X1 := X1 ⊕ X0 ⊕ X2

	d.w = d.w ^ d.z ^ (d.x << 3);			// X3 := X3 ⊕ X2 ⊕ (X0 << 3)

	d.y = (d.y << 1) | (d.y >> 31);			// X1 := X1 <<< 1

	d.w = (d.w << 7) | (d.w >> 25);			// X3 := X3 <<< 7

	d.x = d.x ^ d.y ^ d.w;				// X0 := X0 ⊕ X1 ⊕ X3

	d.z = d.z ^ d.w ^ (d.y << 7);			// X2 := X2 ⊕ X3 ⊕ (X1 << 7)

	d.x = (d.x << 5) | (d.x >> 27);			// X0 := X0 <<< 5

	d.z = (d.z << 22) | (d.z >> 10);		// X2 := X2 <<< 22

	d_data[gIdx] = d;

}

}

[/codebox]

Hi all, and thanks for the replies.

Eyal, thank you for clearing that up. You are right. I forgot that aside from my dynamically allocated shared memory, there’s the issue of static shared memory. In my case it seems to mount up to 32 bytes, which does explain the occupancy degradation. Thanks for pointing that out. A friend offered I’d go with 192 threads per block, instead of 256, thus increasing the active warps to 4, and lowering the required shared memory. I’ll give that a try.

Tera, Thanks for the tip on using uint4. I’ll give it a try and compare to the 192 threads per block solution I mentioned above combined with explicitly putting the data into registers, and to the former suggestion you came up with (reducing the shared memory usage).

I will post the results back to this thread for your review.

– Liad Weinberger.

Hi all,

Here are some results. I used a 40KB input for the tests.

192tpb with padding of shared memory:

    [*]GPU Time: 53.28 usec

    [*]Shared mem: 3872 Bytes

    [*]Registers per thread: 10

    [*]Occupancy: 100%

    [*]Warp serialize: 1737

    [*]GST Coalesced: 5184

256tpb with padding of shared memory:

    [*]GPU Time: 54.016 usec (+1.38%)

    [*]Shared mem: 5152 Bytes

    [*]Registers per thread: 10

    [*]Occupancy: 67%

    [*]Warp serialize: 1823

    [*]GST Coalesced: 5120

192tpb with padding of shared memory, and explicit registers:

    [*]GPU Time: 50.56 usec (-5.11%)

    [*]Shared mem: 3872 Bytes

    [*]Registers per thread: 9

    [*]Occupancy: 100%

    [*]Warp serialize: 1851

    [*]GST Coalesced: 5184

256tpb no shared memory (tera’s latest suggestion):

    [*]GPU Time: 38.816 usec (-27.15%)

    [*]Shared mem: 32 Bytes

    [*]Registers per thread: 8

    [*]Occupancy: 100%

    [*]Warp serialize: 0

    [*]GST Coalesced: 5120

I did not attempt tera’s previous suggesting for reducing shared memory count yet. Will attempt it tomorrow.

Thus far it seems that the best option was tera’s last suggestion!

Best regards,

– Liad Weinberger.

I am surprised. And irritated by the fact that my latest suggestion produces four times as many coalesced accesses. Could you by any chance have run it on a dataset four times as large, as it now operates on uint4s?

You are right!

So sorry, forgot to divide the actual block and count by 4. Fixed the stats, check it out! Amazing job! Thank you so much!

Hi all,

I am humbled by my own stupidity. Sorry, it seems I have made a really dumb mistake in the kernel executions of all but the fixed run of tera’s final suggestion. After posting the correction I set off to work and on the way I realized I was executing WAY TOO MANY threads in every kernel execution.

Since each thread, handles 4 words of 32bit, the grid size should be (PADDED_DATA_LENGTH / 4) / 256 and not PADDED_DATA_LENGTH / 256, like I wrote it was. This should decrease the GPU time of all the runs dramatically, plus have another benefit of not overflowing the shared memory (funny how that had no affect on the card’s performence).

Also, as mentioned above, in tera’s final suggestion, I forgot to pass a revised value to the sz parameter of the kernel. The current run results for that suggestion includes a correction for both issues. I will fix the grid size issue in the other runs as well and post the correct results later today (my evening).

Again thank you for the replies and aid, and interest in this issue, and sorry for not double checking my logic before posting the results.

– Liad Weinberger.

Hi,

Added the fixed runs benchmarks. This clearly shows that reducing the number of transactions on the global memory by reading words of 128 bits per thread, according to tera’s final suggestion, benefits much more than using shared memory, and reading in words of 32 bits per thread.

I also attempted another shared memory based solution, which was supposed to drop bank conflicts entirely, but it didn’t prove any better than the 192 threads basis of comparison, so I wont post its results.

I will probably start a new thread sometime next week, detailing the optimization concerns for a different kernel, responsible for the key mixing and s-box operation of the serpent encryption algorithm.

Thank you to all who contributed to this thread, and especially to tera for providing a winning solution. This has been highly educational for me.

– Liad Weinberger.