Hi All
I am a newbie in CUDA, so may be the question is very simple. But can’t find a good solution.
I need to get “integral” sums of the column of 2D matrix.
for(int j=0; j<H; j++){
r[j][0] = g[j][0];
}
so 0-th column of r is equal to first column of g
for(int j=0; j<H; j++)
for(int i=1; i<W; i++)
r[j][i] = r[j][i-1] + g[j][i];
so 1-st column of r is the sum of the 0-th and 1-st columns of g, 2nd column of r is the sum of 0,1,2 columns of g and so on.
When parallelizing “by row” I have a lot of uncoalesced reading and writing because memory locations are not in the order (i-th thread must have access to i- th word. Trying to parallelize “by column” caused to the problem that i-th column can be calculated only after (i-1) is calculated.
Matrices are large, 5000*5000 approximately. r and g are now float.
Can anybode help ?
With best regards
Pavel Dvorkin