Does number of shared memory banks effect results?

Hi Everyone;

I have a theory about the shared memory banks in GPU. I wrote a code and the last
command of the code is following.

odata[tid] += sdata[tid];

odata is placed in the global memory and sdatas are in the shared memory. As you can
see I want to summarize all sdata arrays by index order. Namely the first element
of the sdata in threadBlock0 is summarized by the first element of the sdata in
threadBlock1 and so on.

I have realized something from the results. When the number of thread blocks <=16 than the result
is true. The other cases (number of thread blocks > 16) the result is missing sometimes, sdatas
in some blocks are skipped in the summation process. I tested different number of threads in
the threadBocks and the result is the same.

I thought the reason of the situation is that the number of shared memory banks in
GPU are 16. Is this logical? Any suggestions? Is there any different code you could suggest
for this summation process?


The reason probably is that you don’t use atomicAdd() to sum the values in global memory.

Thank you for your advice. But I have two problems for the method you suggested. First,

I have Tesla C1060 with 1.3 compute capability. Therefore I can not use atomic add because

I have to use floating point version. Compute capability of my card does not support, this

feature is only supported by devices of CC 2.*. Second is the performans issue, atomic operations

are too slow.

Appendix B.11 “Atomic Functions” of the (4.0) Programming Guide shows how to create any atomic function from atomicCAS().
If you don’t want to use atomic operations, you can alternatively use a reduction scheme.

Unfortunately atomicCAS() is also used for int*, not float*.

I have also reduction methods in calculation of shared memory part. But I don’t know how I can

used the method for the part of code,

[b]odata[tid] += sdata[tid];


odata is placed in the global memory and sdatas are in the shared memory.

Could you give me any advice?


Part of the magic lies in using the __float_as_int() intrinsic function. Just check appendix B.11.

A reduction scheme would reserve an array in global memory where each block just stored its result in its own element. The elements would then be summed by a separately launched second kernel (or just by the CPU if there aren’t too many elements).

Thanks, I tried two methods; without and with AtomicAdd(). The block number is 32. Array(int) is 330M.

The result is interesting. AtomicAdd gives correct results and its running time is short than the method without AtomicAdd.

Now, I will try its float version…