problem passing big loop to cuda


I have a huge loop like the following (simplified for brevity)

for(i = 1 ; i < …; i++)
for(k = 1 ; k < …; k++){
b[i][k] = b[i-1][k] + some stuff

so basically in this loop the [i] depends on [i-1] on the shared component b previously assigned

I have already written the coda in cuda form

The problem is that I do not know how to ensure that the [i-1] will always calculated before the [i] on the same k without incurring in deadlocks…

__synchronize wont work because I am using many block not only threads.
Please help

this looks like a cummulative sum, so check out the scan algorithm to maybe get some ideas on how to do this.

no actually it is not exactly a sum

I wrote this “b[i][k] = b[i-1][k] + some stuff” just to show that the ith element depends on the (i-1)th element…it is not exactly a “sum” what the algorithm those

Well, then you can keep the loop over i as a simple solution.

What is possible is the following : split k over you blocks, so keep all i’s belonging to the same k within 1 block. Then with your N threads within the block calculate the first N, then the second N, etc. etc. until you calculated all i’s. Then you can use synchtreads, to make sure all you N values have been calculated.