I have a theory about the shared memory banks in GPU. I wrote a code and the last
command of the code is following.
odata[tid] += sdata[tid];
odata is placed in the global memory and sdatas are in the shared memory. As you can
see I want to summarize all sdata arrays by index order. Namely the first element
of the sdata in threadBlock0 is summarized by the first element of the sdata in
threadBlock1 and so on.
I have realized something from the results. When the number of thread blocks <=16 than the result
is true. The other cases (number of thread blocks > 16) the result is missing sometimes, sdatas
in some blocks are skipped in the summation process. I tested different number of threads in
the threadBocks and the result is the same.
I thought the reason of the situation is that the number of shared memory banks in
GPU are 16. Is this logical? Any suggestions? Is there any different code you could suggest
for this summation process?