hi, i’m trying to do a function in cuda to sum all the values of an array. i have this:
__global__ void cuArraySumF_D(float *src,float *sum,int len){
__shared__ float s;
s=0;
__syncthreads();
for(int i = threadIdx.x; i<len; i+=N_THREADS)
s+=src[i];
__syncthreads();
if(threadIdx.x == 0)
*sum = s;
}
i’m creating N_THREADS in only one block for shared memory to work.
but it fails to run. i guess it’s because i’m writing to the same shared variable.
so, is there anyway to optimize an array sum in cuda using threads?
thanks
Looks like you are doing the wrong thing.
See the reduction sample in SDK. (reduction document too)
Which does the same thing you want.
May be you can reuse them… :smile2:
thanks that’s really what i needed to see :)
You can also take a loot at CUDPP
can you paste example sum array in cuda?
I found an example of the reduction array, but I do not understand i^1
for ( i = 0; n >= BLOCK_SIZE; n /= (2*BLOCK_SIZE), i++ ){
dim3 dimBlock (BLOCK_SIZE, 1, 1);
dim3 dimGrid (n / (2*dimBlock.x), 1, 1);
reduce4 <<< dimGrid, dimBlock >>> (adev[i], adev[i^1]);
}
I understand that i^2:
i | i^1
0|1
1|0
2|3
3|2
4|5
5|4
but why? can not be simpler?