going through the reduction example in the SDK, i couldn’t understand why in
kernel-6 their is this term blockSize*2
as in anyway each thread processes multiple elements.
this would make sense only if every thread would process multiple elements for sdata[tid] and then also for sdata[tid+blocksize]
then in the first reduction done on shared memory all the threads will be active, which should improve performance slightly.
as a whole the SDK is a great help.
keep up the good work it is being appreciated.