As far as I know, two successive shorts in memory are located in the same memory bank. Two successive ints however reside in different banks. In your code there will be two way bank conflicts among pairs of neighbor threads. A solution would be a different interleaving scheme for your shorts.
The other problem, as already noted is the 256 array stride, resulting in each memory row starting at the same bank. Going to a stride of 258 would improve things.
Bank accesses are done in multiples of 32 bits, as shorts are 16 bits, the multi-processor still loads 32 bits. In your example, thread #0 conflicts with thread #1, thread #2 with #3, etc at each short loaded in the summation. I’d just leave them as 4 byte values and eat 8k of shared memory. See slide #4 of this presentation: http://courses.ece.uiuc.edu/ece498/al1/lec…fall%202007.ppt
Alternately if eating the 8k of shared memory isn’t an option you could do this:
You didn’t read his code carefully. His stride is 4 bytes between threads, not 2. Notice how he’s indexing a float array and reinterpreting it as shorts. Likewise, stride between successive loads is irrelevant.
Serializations are not caused just be bank conflicts. Other causes are randomish constant memory access and atomics. (Anyone know of any others?)