Hi everyone;

I have a problem in my cuda algorithm about array summation.

Suppose that we have an array have 617 x 4 = 2468 element

(617 is a prime number) and I want to calculate summation

of [0-616], [617-1233],[1234-1850] and [1851-2467] array parts.

One thread block has maximum 512 threads and size of each part

doesn t divide any size of thread block exactly. So, how can

I update the following cuda code for this purpose.

[codebox]**global** void SumArray(float *a, float *b, float *c, const unsigned int N)

{

```
unsigned int tx = threadIdx.x;
unsigned int bx = blockIdx.x;
```

unsigned int aBegin = blockDim.x * bx;

```
unsigned int aEnd = N;
unsigned int aStep = 512;
```

for(unsigned int k =0; k < 4; ++k)

```
{
float total = 0;
```

for(unsigned int i = aBegin; i < aEnd; i += aStep)

```
{
__shared__ float aS[512];
```

aS[tx] = a[i + tx];

```
__syncthreads();
```

for(unsigned int j = 0; j < 512; ++j)

```
total += aS[j];
__syncthreads();
```

}

```
c[k] = total;
}
```

return;

}[/codebox]

Many thanks…