SDK's reduction example is not optimized?

going through the reduction example in the SDK, i couldn’t understand why in
kernel-6 their is this term blockSize*2
as in anyway each thread processes multiple elements.

this would make sense only if every thread would process multiple elements for sdata[tid] and then also for sdata[tid+blocksize]
then in the first reduction done on shared memory all the threads will be active, which should improve performance slightly.

as a whole the SDK is a great help.
keep up the good work it is being appreciated.

I think that this was made to guarantee no bank conflicts on shared memory.