simple cdot implementation 2nd hwoto improve performance

I have some questions about the scalarprod project.

  1. Why is for instance the thread with the threadIdx.x = 0 and blockIdx.x = 0 multiplying the vector elements [0 256 512 768] = iAccum and why must pos be for this configuration [0 1024 2048 3072] if iAccum is 0 ? Are there any coalescing or bank conlfict issues I dont see ?
       

      for(int iAccum = threadIdx.x; iAccum < ACCUM_N; iAccum += blockDim.x){

            float sum = 0;

           for(int pos = vectorBase + iAccum; pos < vectorEnd; pos += ACCUM_N)

                sum += d_A[pos] * d_B[pos];

           accumResult[iAccum] = sum;

      }
  1. How can we speculate that the reduction process is finished with threadIdx.x = 0 when the issue order of warps is undefined ?
       for(int stride = ACCUM_N / 2; stride > 0; stride >>= 1){

            __syncthreads();

            for(int iAccum = threadIdx.x; iAccum < stride; iAccum += blockDim.x)

                accumResult[iAccum] += accumResult[stride + iAccum];

        }

       if(threadIdx.x == 0) d_C[vec] = accumResult[0];
  1. I have simplified the version into , whereas the input array of the dot product is not greater than 128.
       int index = threadIdx.x;

	__shared__ float accum[VECTOR_DIM];

	accum[index] = g_idata[index] * g_idata[index];

	for(int stride = VECTOR_DIM / 2; stride > 0; stride >>= 1){

           __syncthreads();

           accum[index] += accum[stride + index];

        }

	if(index == 0) 

  *g_odata = accum[0];

How can I improve my code. I dont know if i access the memory in a right way (coalescing,bank-conflikt).

Thx , Cem