sum over a matrix how to parallelize

I need to parallelize, a sum like this over a matrix: (threads index can be used for addressing elements of the matrix)

[codebox]for(unsigned int l=0;l<(blockSize*blockSize);l++)

{

    accsum+=*(sum + l);

    accsumsqr+=*(sumsqr + l);

    accsumqrt+=*(sumqrt + l);

}[/codebox]

in which way i can do this avoiding banck conflicts?

The “Reduction” program in Nvidia SDK is a good reference tutorial in solving your problem.

Also check these sources:
http://www.cse.chalmers.se/~billeter/papers.html - compaction
http://www.cse.chalmers.se/~billeter/pub/pp/index.html - compaction, prefix sum, sorting

Those provide one of the fastest currently available algorithms.