# syncthreads() and += operator...

Hi Everyone;

I am new about cuda and I tried to write first cuda code using += operator.

The kernel is following:

[codebox]global void SumArray(float *a, float *c, const unsigned int N)

{

``````unsigned int i = threadIdx.x + blockIdx.x * blockDim.x;

unsigned int j = threadIdx.y + blockIdx.y * blockDim.y;

unsigned int index = i * N + j;

float cSub = 0;
``````

cSub += a[index];

``````__syncthreads();
``````

c[0] = cSub;

return;

}[/codebox]

The kernel returns cSub = 1 (all elements of a are 1).Each thread has

own cSub value so the results might be logic, but I use syncthreads()

commands for this situation. I have Quaro FX 5600 and I use Cuda 2.0.

your kernel is equivalent to following one

``````__global__ void SumArray(float *a, float *c, const unsigned int N)

{

unsigned int i = threadIdx.x + blockIdx.x * blockDim.x;

unsigned int j = threadIdx.y + blockIdx.y * blockDim.y;

unsigned int index = i * N + j;

c[0] = a[index];
``````

this is not summation of an array, just write all elements of “a” to first element of “c”,

race condition occurs.

Thanks for explanation. So how can I calculate sum of all elements of an array?

I looked the reduction example in sdk. But it doesn t help me because I put some complex
conditions for sum operator.

Thanks again…

what’s your complex conditions?

In my algorithm, there is a main for loop calculating result of some parts of the array.
In each loop, location and length of the parts are changing. For example
(it is just an example loops and values)

for (k = 0; k<4; k++)
{
total = 0;
mStart = k * 12;
mStep = k * 36;
for (m = mStart; m<arraySize; m += mStep)
{
Boundary = m + 6;
for (Index = m; Index < Boundary; Index++)
total += array[Index];

}
totalArray[k] = total;
}

The simplest way is to use one thread block to deal with one “k”, so

you need 4 thread block ( this means only 4 SM are used).

you can use one thread block (32 thread per block) to do reduction, the same as in SDK reduction example,

``````__shared__ float partial_sum[32];

partial_sum[ threadIdx.x ] = 0.0f;

k = blockIdx.x;

mStart = k * 12;

mStep = k * 36;

for (m = mStart + threadIdx.x; m < arraySize; m += mStep)

{

partial_sum[ threadIdx.x ] += array[ m ];

}

// then do reduction for 6 elements, please see reduction example
``````

However this method has several drawbacks

(1) only 4 SM are used, that means that bandwidth is only 4/8 = 1/2

(2) each thread block has 32 threads only and each SM has one thread block.

This means that one SM has only 32 active threads, this cannot hide pipeline latency

(to hide pipeline latency, one SM must have 192 threads)

(3) although you update 32 elements in shared memory, only 6 among them are what you want.

So effective bandwidth is 6/32.

Combine (1) and (2), this method reaches ( 1/4 ) * (6/32) = 1/20 bandwidth at most,

if you use Tesla C1060, that bandwidth is about 5GB/s, maybe slightly better than CPU code.

Thank you so much for your quick help…