I need to parallelize, a sum like this over a matrix: (threads index can be used for addressing elements of the matrix)
[codebox]for(unsigned int l=0;l<(blockSize*blockSize);l++)
{
accsum+=*(sum + l);
accsumsqr+=*(sumsqr + l);
accsumqrt+=*(sumqrt + l);
}[/codebox]
in which way i can do this avoiding banck conflicts?
The “Reduction” program in Nvidia SDK is a good reference tutorial in solving your problem.