smem bank conflicts

Hi,

I have the following piece of code, which generates some warp serialize(smem bank conflicts) in the profiler. I could not find the reason. The block size is 16x16 threads.

shared float s_r[1024];

…The kernel is quite large. I use s_r array for floats mainly, but at some point I need it for 8*256 data, which can fit as shorts…

float t1sum=((unsigned short *)(&s_r[tid]))[0]
+((unsigned short *)(&s_r[tid+256]))[0]
+((unsigned short )(&s_r[tid+2256]))[0]
+((unsigned short )(&s_r[tid]+3256))[0];

Can someone help with a hint? Why it generates bank conflicts?

Regards,

As far as I remember, the stride for shared mem access must be odd (the reason is explained in the programming guide).

Try allocating more memory and use 257 instead of 256.

As far as I know, two successive shorts in memory are located in the same memory bank. Two successive ints however reside in different banks. In your code there will be two way bank conflicts among pairs of neighbor threads. A solution would be a different interleaving scheme for your shorts.

The other problem, as already noted is the 256 array stride, resulting in each memory row starting at the same bank. Going to a stride of 258 would improve things.

Christian

Bank accesses are done in multiples of 32 bits, as shorts are 16 bits, the multi-processor still loads 32 bits. In your example, thread #0 conflicts with thread #1, thread #2 with #3, etc at each short loaded in the summation. I’d just leave them as 4 byte values and eat 8k of shared memory. See slide #4 of this presentation: http://courses.ece.uiuc.edu/ece498/al1/lec…fall%202007.ppt

Alternately if eating the 8k of shared memory isn’t an option you could do this:

float t1sum = (((uint *) (&s_r[tid]))[0] & 0xffff)

                    + (((uint *) (&s_r[tid]))[0] >> 16)

                    + (((uint *) (&s_r[tid + 128]))[0] & 0xffff)

                    + (((uint *) (&s_r[tid + 128]))[0] >> 16);

Sorry if I missed a parenthesis somewhere, but the point is summing the upper and lower 16 bits of a 32 bit value.

Hope this helps

-Patrick

You didn’t read his code carefully. His stride is 4 bytes between threads, not 2. Notice how he’s indexing a float array and reinterpreting it as shorts. Likewise, stride between successive loads is irrelevant.

Serializations are not caused just be bank conflicts. Other causes are randomish constant memory access and atomics. (Anyone know of any others?)