Hi All

I am a newbie in CUDA, so may be the question is very simple. But can’t find a good solution.

I need to get “integral” sums of the column of 2D matrix.

```
for(int j=0; j<H; j++){
r[j][0] = g[j][0];
}
```

so 0-th column of r is equal to first column of g

```
for(int j=0; j<H; j++)
for(int i=1; i<W; i++)
r[j][i] = r[j][i-1] + g[j][i];
```

so 1-st column of r is the sum of the 0-th and 1-st columns of g, 2nd column of r is the sum of 0,1,2 columns of g and so on.

When parallelizing “by row” I have a lot of uncoalesced reading and writing because memory locations are not in the order (i-th thread must have access to i- th word. Trying to parallelize “by column” caused to the problem that i-th column can be calculated only after (i-1) is calculated.

Matrices are large, 5000*5000 approximately. r and g are now float.

Can anybode help ?

With best regards

Pavel Dvorkin