smem bank conflicts

bog · September 23, 2008, 4:32pm

Hi,

I have the following piece of code, which generates some warp serialize(smem bank conflicts) in the profiler. I could not find the reason. The block size is 16x16 threads.

shared float s_r[1024];

…The kernel is quite large. I use s_r array for floats mainly, but at some point I need it for 8*256 data, which can fit as shorts…

float t1sum=((unsigned short *)(&s_r[tid]))[0]
+((unsigned short *)(&s_r[tid+256]))[0]
+((unsigned short )(&s_r[tid+2256]))[0]
+((unsigned short )(&s_r[tid]+3256))[0];

Can someone help with a hint? Why it generates bank conflicts?

Regards,

Romant · September 23, 2008, 7:14pm

As far as I remember, the stride for shared mem access must be odd (the reason is explained in the programming guide).

Try allocating more memory and use 257 instead of 256.

cbuchner1 · September 23, 2008, 7:18pm

As far as I know, two successive shorts in memory are located in the same memory bank. Two successive ints however reside in different banks. In your code there will be two way bank conflicts among pairs of neighbor threads. A solution would be a different interleaving scheme for your shorts.

The other problem, as already noted is the 256 array stride, resulting in each memory row starting at the same bank. Going to a stride of 258 would improve things.

Christian

pstach · September 27, 2008, 1:42am

Bank accesses are done in multiples of 32 bits, as shorts are 16 bits, the multi-processor still loads 32 bits. In your example, thread #0 conflicts with thread #1, thread #2 with #3, etc at each short loaded in the summation. I’d just leave them as 4 byte values and eat 8k of shared memory. See slide #4 of this presentation: http://courses.ece.uiuc.edu/ece498/al1/lec…fall%202007.ppt

Alternately if eating the 8k of shared memory isn’t an option you could do this:

float t1sum = (((uint *) (&s_r[tid]))[0] & 0xffff)

                    + (((uint *) (&s_r[tid]))[0] >> 16)

                    + (((uint *) (&s_r[tid + 128]))[0] & 0xffff)

                    + (((uint *) (&s_r[tid + 128]))[0] >> 16);

Sorry if I missed a parenthesis somewhere, but the point is summing the upper and lower 16 bits of a 32 bit value.

Hope this helps

-Patrick

alex_dubinsky · September 30, 2008, 3:09am

You didn’t read his code carefully. His stride is 4 bytes between threads, not 2. Notice how he’s indexing a float array and reinterpreting it as shorts. Likewise, stride between successive loads is irrelevant.

Serializations are not caused just be bank conflicts. Other causes are randomish constant memory access and atomics. (Anyone know of any others?)

Topic		Replies	Views
How to understand the bank conflict of shared_mem CUDA Programming and Performance	12	9991	January 16, 2025
Shared Memory "Bank Conflicts" I'am confused... CUDA Programming and Performance	11	3468	August 20, 2009
Bank Conflicts and Serialized Warps CUDA Programming and Performance	6	7806	December 4, 2009
dont understand bank conflicts for shared mem CUDA Programming and Performance	7	2629	March 31, 2010
cuda profiler reports high warp serialize CUDA Programming and Performance	5	2057	May 14, 2010
Having problems with warp divergence/serialization profiler: high warp serialize rate although diver CUDA Programming and Performance	4	1663	October 27, 2009
Shared memory banks usage How to spread the data among banks ? CUDA Programming and Performance	3	4088	July 4, 2008
Shared memory access patterns CUDA Programming and Performance	2	1097	March 4, 2010
Help understanding bank conflicts in transpose example CUDA Programming and Performance	5	6657	February 8, 2009
128-bit access bank conflict CUDA Programming and Performance	11	955	March 29, 2024

smem bank conflicts

Related topics