Shared memory bank conflict


After … a long time, I boiled a performance issue down to the simplified code below. It’s most probably due shared memory bank conflicts. I read some NVIDIA doc pages (again and again), but I don’t really see how I can avoid the bank conflicts, since all solutions I saw use strided/offset methods with thread index involved. In my case, the x[] array is sequentially populated inside the kernel and calculations done on the entire array also inside the kernel.

__global__ void k() {
	__shared__ uint64_t x[32] = { 0 };
	for (i = 0; i < 32; i++)
		x[i] =  i;

The performance loss is about factor 2.5. I need to get rid of this. A hint would be sufficient.

Many thanks!

Probably your “boiling” down process has removed important info.

Anyway, a concern about bank conflicts can only be considered in light of behavior of threads in a warp, not a single thread.

Your code as shown here (the for-loop, at least) would not demonstrate any bank conflicts. Even if we consider multiple threads in a warp, for the code you have shown here, there is no reason to presume bank conflicts. All threads accessing the same location in shared memory in any particular cycle is not a bank conflict.

I meant “bank conflict” regarding the block (= multiple warps).

Problem solved. I checked again (and again) and found the issue (indeed in the rest of the code - but too lengthy). The shared x[] array was written by the threads - simultaneously. This was a stupid coding error from my side, not even a missing __syncthreads(). Needless to say that this obviously generated conflicts which reduced performance dramatically! Thanks for the kick …