shared memory example to be found v.2 regarding example problem : scalarProd

in the scalarPrduc example, more specifically in the scalarProd_kernel.cu file,

the original one uses a reduction method as shown in the following,

#define ACCUM_N 1024

for(int vvec = blockIdx.x; vec<VECTOR_N; vec += gridDim.x)
{

for(int stride = ACCUM_N / 2; stride > 0; stride >>= 1)
{
__syncthreads();
for(int iAccum = threadIdx.x; iAccum < stride; iAccum += blockDim.x)
accumResult[iAccum] += accumResult[stride + iAccum];
}
if(threadIdx.x == 0) d_C[vec] = accumResult[0];
}

my understanding of this part is that it sums up accumResult[0~1023] and save it to accumResult[0] and d_C[vec].

now, i am playing around with this code and i replace the summing part with the following;

float temp=0;
for ( int it=0;it<1024; ++it ) {
temp += accumResult[it];
__syncthreads(); }
d_C[vec] = temp;

bacause i am in situation where i can not make the shared memory size as power of 2 so i can not use the reduction method.

it seems like my logic on using shared memory is not right in the above edition.

i compare the results from the two routines and they are different and i assume i am not doing right.

any comments on summation part?

Many thanks in advance for your valuable comments and advice…