 The length of elements should be 2^n

What if the length is odd or other number such as 15,60,etc.

For example:
float array = {1,2,3,4,5…100};
I’d like to get the sum of this array.
But the available SDK example just provide the situation when the the length of array is 2^n, not support odd.

I wondered is there any efficient and best kernel-way to get the sum of the array above?

I say CUDPP

often you just add additional zeros, so if your array is of 100 elements, add 128 elements. where 28 elements would be zero. In some cases it’s faster to add 128 numbers than to add 100 elements with a lot of additional code that complicates parallel algorithm ;-)

Another option is to add the data size as a parameter to the kernel, then add an if statement inside the kernel so to only read from values inside the array. For example, in the first reduce kernel from the SDK:

``````__global__ void

reduce0(int *g_idata, int *g_odata, const unsigned int size)

{

extern __shared__ int sdata[];

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

sdata[tid] = (i < size) ? g_idata[i] : 0;                         // <---this line changed

// do reduction in shared mem

for(unsigned int s=1; s < blockDim.x; s *= 2) {

// modulo arithmetic is slow!

if ((tid % (2*s)) == 0) {

sdata[tid] += sdata[tid + s];

}

}

// write result for this block to global mem

if ((i < size) && (tid == 0)) g_odata[blockIdx.x] = sdata;               // <---this line changed

}
``````

You also need to change the code calling the kernel to launch enough blocks to cover the entire array.

(This is just to illustrate the basic idea - I haven’t compiled or tested this, but I use a similar method in parts of my code.)