column summarizing in parallel

Hi All
I am a newbie in CUDA, so may be the question is very simple. But can’t find a good solution.

I need to get “integral” sums of the column of 2D matrix.

   for(int j=0; j<H; j++){
        r[j][0] = g[j][0];

so 0-th column of r is equal to first column of g

    for(int j=0; j<H; j++)
      for(int i=1; i<W; i++)
            r[j][i] = r[j][i-1] + g[j][i];

so 1-st column of r is the sum of the 0-th and 1-st columns of g, 2nd column of r is the sum of 0,1,2 columns of g and so on.

When parallelizing “by row” I have a lot of uncoalesced reading and writing because memory locations are not in the order (i-th thread must have access to i- th word. Trying to parallelize “by column” caused to the problem that i-th column can be calculated only after (i-1) is calculated.

Matrices are large, 5000*5000 approximately. r and g are now float.
Can anybode help ?

With best regards
Pavel Dvorkin

Would transposing your matrix allow coalesced reads?

Otherwise if coalesced reads are not possible, you could store the texture in a cudaArray and read it with tex2D. That can efficiently read across rows and down columns. Just make sure to get your writes coalesced.

Parallelize by row but read 16 (or 32) columns at once (coalesced), then calculate the sum of these columns via reduction. You will have to use reduction because otherwise you will not get enough parallelism since you only have enough shared memory to process less than 256 (or 128) rows at once.

Thank you, will try