I have some questions about the scalarprod project.
- Why is for instance the thread with the threadIdx.x = 0 and blockIdx.x = 0 multiplying the vector elements [0 256 512 768] = iAccum and why must pos be for this configuration [0 1024 2048 3072] if iAccum is 0 ? Are there any coalescing or bank conlfict issues I dont see ?
for(int iAccum = threadIdx.x; iAccum < ACCUM_N; iAccum += blockDim.x){
float sum = 0;
for(int pos = vectorBase + iAccum; pos < vectorEnd; pos += ACCUM_N)
sum += d_A[pos] * d_B[pos];
accumResult[iAccum] = sum;
}
- How can we speculate that the reduction process is finished with threadIdx.x = 0 when the issue order of warps is undefined ?
for(int stride = ACCUM_N / 2; stride > 0; stride >>= 1){
__syncthreads();
for(int iAccum = threadIdx.x; iAccum < stride; iAccum += blockDim.x)
accumResult[iAccum] += accumResult[stride + iAccum];
}
if(threadIdx.x == 0) d_C[vec] = accumResult[0];
- I have simplified the version into , whereas the input array of the dot product is not greater than 128.
int index = threadIdx.x;
__shared__ float accum[VECTOR_DIM];
accum[index] = g_idata[index] * g_idata[index];
for(int stride = VECTOR_DIM / 2; stride > 0; stride >>= 1){
__syncthreads();
accum[index] += accum[stride + index];
}
if(index == 0)
*g_odata = accum[0];
How can I improve my code. I dont know if i access the memory in a right way (coalescing,bank-conflikt).
Thx , Cem