in the scalarPrduc example, more specifically in the scalarProd_kernel.cu file,
the original one uses a reduction method as shown in the following,
#define ACCUM_N 1024
…
for(int vvec = blockIdx.x; vec<VECTOR_N; vec += gridDim.x)
{
for(int stride = ACCUM_N / 2; stride > 0; stride >>= 1)
{
__syncthreads();
for(int iAccum = threadIdx.x; iAccum < stride; iAccum += blockDim.x)
accumResult[iAccum] += accumResult[stride + iAccum];
}
if(threadIdx.x == 0) d_C[vec] = accumResult[0];
}
my understanding of this part is that it sums up accumResult[0~1023] and save it to accumResult[0] and d_C[vec].
now, i am playing around with this code and i replace the summing part with the following;
float temp=0;
for ( int it=0;it<1024; ++it ) {
temp += accumResult[it];
__syncthreads(); }
d_C[vec] = temp;
bacause i am in situation where i can not make the shared memory size as power of 2 so i can not use the reduction method.
it seems like my logic on using shared memory is not right in the above edition.
i compare the results from the two routines and they are different and i assume i am not doing right.
any comments on summation part?
Many thanks in advance for your valuable comments and advice…