sum over a matrix how to parallelize

I need to parallelize, a sum like this over a matrix: (threads index can be used for addressing elements of the matrix)

[codebox]for(unsigned int l=0;l<(blockSize*blockSize);l++)


    accsum+=*(sum + l);

    accsumsqr+=*(sumsqr + l);

    accsumqrt+=*(sumqrt + l);


in which way i can do this avoiding banck conflicts?

The “Reduction” program in Nvidia SDK is a good reference tutorial in solving your problem.

Also check these sources: - compaction - compaction, prefix sum, sorting

Those provide one of the fastest currently available algorithms.