Not sure where to start for problem that is parallel in all but one dimension

The problem is to calculate a recursive function in parallel. I’ve been playing around with Thrust, and it seems like it could almost do what I want if I could figure out how to combine a few of the examples, but I’m having trouble doing that. I’m not sure if Thrust will really be able to do what I want, or if I should try another library.

I want to calculate something like

x(t+1) = a(t+1) + b(t+1)*x(t)

for t=0 to n, and for many independent “people,” which could really be over several dimensions, but the important aspect is that each recursive formula is in it’s own “world” and could be computed in parallel. a() and b() could also be in parallel, or could be considered given. In addition, x(0) for each person could be considered given.

The problem I’m running into with Thrust is that it seems to have the ability to iterate over tuples with thrust::for_each, and has the ability to calculate cumulative sums with thrust::transform_inclusive_scan, but I don’t know how I can do both at the same time.

I’d like a solution where each thread could be given a number of vector arguments, and then perform a recursive calculation. Can I do that with Thrust, or would some other library be better? I considered doing each time step in Thrust and then iterating through time, but this poses an unnecessary restriction that each time step must be completed before the next one can start.

Another way to say it is that the all-prefix-sum operation, like it is described here:

http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html

is changed from

new = [something] + [old]

to

new = [something] + [something else]*[old]

Which seems like a trivial change, so that’s why I’m surprised that I haven’t found an easy way to do this.

i would attempt to write x(n) as the difference of its terms:

x(1) = a(1) + b(1)x(0)
x(2) = a(2) + b(2)x(1)
x(3) = a(3) + b(3)x(2)

then:

x(2) = a(2) + b(2)a(1) + b(2)b(1)x(0)
x(3) = a(3) + b(3)a(2) + b(3)b(2)a(1) + b(3)b(2)b(1)x(0)

if you then build a product scan - similar to a sum scan, but you multiply instead of summing - of b(n), you should be left with something rather parallel
just check
perhaps just write out x(4) and x(5) to truly see how x(n) progresses, and what you thus would require

Couldn’t you just do it in two steps?

Something like

[old] = [something else] * [old]

new = [something] + [old]

Each of those steps could be trivially parallelised, maybe you should be using the CUBLAS libraries?