riclas
#1
hi, i’m trying to do a function in cuda to sum all the values of an array. i have this:
__global__ void cuArraySumF_D(float *src,float *sum,int len){
__shared__ float s;
s=0;
__syncthreads();
for(int i = threadIdx.x; i<len; i+=N_THREADS)
s+=src[i];
__syncthreads();
if(threadIdx.x == 0)
*sum = s;
}
i’m creating N_THREADS in only one block for shared memory to work.
but it fails to run. i guess it’s because i’m writing to the same shared variable.
so, is there anyway to optimize an array sum in cuda using threads?
thanks
Sibi_A
#2
Looks like you are doing the wrong thing.
See the reduction sample in SDK. (reduction document too)
Which does the same thing you want.
May be you can reuse them… :smile2:
riclas
#3
thanks that’s really what i needed to see :)
You can also take a loot at CUDPP
can you paste example sum array in cuda?
I found an example of the reduction array, but I do not understand i^1
for ( i = 0; n >= BLOCK_SIZE; n /= (2*BLOCK_SIZE), i++ ){
dim3 dimBlock (BLOCK_SIZE, 1, 1);
dim3 dimGrid (n / (2*dimBlock.x), 1, 1);
reduce4 <<< dimGrid, dimBlock >>> (adev[i], adev[i^1]);
}
I understand that i^2:
i | i^1
0|1
1|0
2|3
3|2
4|5
5|4
but why? can not be simpler?