I have a theory about the shared memory banks in GPU. I wrote a code and the last
command of the code is following.
odata[tid] += sdata[tid];
__syncthreads();
odata is placed in the global memory and sdatas are in the shared memory. As you can
see I want to summarize all sdata arrays by index order. Namely the first element
of the sdata in threadBlock0 is summarized by the first element of the sdata in
threadBlock1 and so on.
I have realized something from the results. When the number of thread blocks <=16 than the result
is true. The other cases (number of thread blocks > 16) the result is missing sometimes, sdatas
in some blocks are skipped in the summation process. I tested different number of threads in
the threadBocks and the result is the same.
I thought the reason of the situation is that the number of shared memory banks in
GPU are 16. Is this logical? Any suggestions? Is there any different code you could suggest
for this summation process?
Appendix B.11 “Atomic Functions” of the (4.0) Programming Guide shows how to create any atomic function from atomicCAS().
If you don’t want to use atomic operations, you can alternatively use a reduction scheme.
Part of the magic lies in using the __float_as_int() intrinsic function. Just check appendix B.11.
A reduction scheme would reserve an array in global memory where each block just stored its result in its own element. The elements would then be summed by a separately launched second kernel (or just by the CPU if there aren’t too many elements).