lexagraphical order is there any way to use CUDA effectively

I am looking for a solution to a problem,recast in cuda.
where a grid of data items, row by colunm, needs to processed
rows r=0 to 1440-1 and columns c from 0 to 32-1
F[r,c] r=0,1440-1,c=0,31

The job is to do F of 1440 * 32 items.

RULE
F[r,c] can be evaluated if

  1. The preceeding row F[r-1,*] is done
  2. The preceeding columns [r,j] j<c done within the same row
  3. Assume r=0 is done at the start.

In short, a given row can only be processed if the preceeding row
has been processed, and within a row, an item can be processed if
all the preceeding columns in that row are known.

This, I believe, demands processing in lexagraphical order by (r,c).
I cant see any way to leverage any parallelism.
Each column c has a function f_c and may be defined using any values from the
immediately preceeeding row, and the preceeding columns within the given row.
If we have 32 columns, then we have 32 such functions.

maxcol = 32
mc=maxcol-1
maxrows =64
mr=maxrows-1

// asume {F[0,0], F[0,1] … F[0,mc]} = F[0,*]

F[1,0] =f_0( F[0,])
F[1,1] =f_1 (F[0,
] , F[1,0])
F[1,2] =f_2 (F[0,] , F[1,0], F[1,1])
F[1,3] =f_3 (F[0,
] , F[1,0], F[1,1], F[1,2])

F[1,mc] =f_mc(F[0,*] , F[1,0], F[1,1], F[1,2]…,F[1,mc-1])

F[2,0] =f_0( F[1,])
F[2,1] =f_1 (F[1,
) , F[2,0])
F[2,2] =f_2 (F[1,] , F[2,0], F[2,1])
F[2,3] =f_3 (F[1,
] , F[2,0], F[2,1], F[2,2])

F[2,mc] =f_mc(F[1,*] , F[2,0], F[2,1], F[2,2]…,F[2,mc-1])

F[r,0] =f_0( F[r-1,] )
F[r,1] =f_1 (F[r-1,
] , F[r,0])
F[r,2] =f_2 (F[r-1,] , F[r,0], F[r,1])
F[r,3] =f_3 (F[r-1,
] , F[r,0], F[r,1], F[r,2])

F[r,mc] =f_mc(F[r-1,*] , F[r,0], F[r,1], F[r,2]…,F[r,mc-1])

F[mr,0] =f_0( F[mr-1,])
F[mr,1] =f_1 (F[mr-1,
], F[mr,0])
F[mr,2] =f_2 (F[mr-1,], F[mr,0], F[mr,1])
F[mr,3] =f_3 (F[mr-1,
], F[mr,0], F[mr,1], F[mr,2])

F[mr,mc] =f_mc(F[mr-1,*] F[mr,0], F[mr,1], F[mr,2]…,F[mr,mc-1])

Since a value depends on all of the rows above it, and all the the values in the current row to the left of the current value, you’re just saying that you have a problem where x[i] depends on all values x[0] to x[i-1]. So they must be processed sequentially. You have no exploitable parallelism whatsoever.

This doesn’t mean it can’t be boosted for parallel computation, but it means you need to pry open your abstraction and look at the actual math… you may be able to process partial results without full information, then revisit the result to update it or combine contributions. (if you’re doing a sum table, for example).

Or process the compute from multiple blocks at once assuming default values for all unknowns, then after each block is done, seeing which finished blocks are invalid because the default assumption wasn’t true for its boundary. (This is common for dynamic programming problems which in theory have strict dependencies but in practice the dependencies are usually local, not global. String alignment edit distance computations come to mind.)

Your opaque black-box f_() functions don’t give any hint about possible strategies.