Interthreaded communication using cuda

hi guys i have a big sequential for loop and the dependency part is there only in the bottom part of the code.
for example:
for( )

independent computation here

a[i]=a[i-1] op a[i]//dependency

before the dependency part all steps are independent.
Now my question is if: i want to implement this in GPU, then i would like all the independent steps to be executed parallely. Now the moment the first iteration is over i want to pass a[i-1] value to successive threads in successive iterations. I dont know if it is a good idea, and if it is then what is the best way to do this?

how many iterations does the for loop run?

what is the ratio/ weight of the independent portion to the dependent portion?

how deep is a?

the loop depends on the size of an image. Now i am working with an image and number of iterations is more then 200.

ratio will be around 2:7

you could likely still parallelize the problem

given the weak ratio, one might simply forget about the independent section

you could still assign a thread to an array element a[i]
the thread would then simply calculate a[i] and a[i - 1], given the relatively cheap cost thereof, compared to the rest of the computation