I am looking for a solution to a problem,recast in cuda.

where a grid of data items, row by colunm, needs to processed

rows r=0 to 1440-1 and columns c from 0 to 32-1

F[r,c] r=0,1440-1,c=0,31

The job is to do F of 1440 * 32 items.

RULE

F[r,c] can be evaluated if

- The preceeding row F[r-1,*] is done
- The preceeding columns [r,j] j<c done within the same row
- Assume r=0 is done at the start.

In short, a given row can only be processed if the preceeding row

has been processed, and within a row, an item can be processed if

all the preceeding columns in that row are known.

This, I believe, demands processing in lexagraphical order by (r,c).

I cant see any way to leverage any parallelism.

Each column c has a function f_c and may be defined using any values from the

immediately preceeeding row, and the preceeding columns within the given row.

If we have 32 columns, then we have 32 such functions.

maxcol = 32

mc=maxcol-1

maxrows =64

mr=maxrows-1

// asume {F[0,0], F[0,1] … F[0,mc]} = F[0,*]

F[1,0] =f_0( F[0,*])
F[1,1] =f_1 (F[0,*] , F[1,0])

F[1,2] =f_2 (F[0,

*] , F[1,0], F[1,1])*

F[1,3] =f_3 (F[0,] , F[1,0], F[1,1], F[1,2])

F[1,3] =f_3 (F[0,

…

F[1,mc] =f_mc(F[0,*] , F[1,0], F[1,1], F[1,2]…,F[1,mc-1])

F[2,0] =f_0( F[1,*])
F[2,1] =f_1 (F[1,*) , F[2,0])

F[2,2] =f_2 (F[1,

*] , F[2,0], F[2,1])*

F[2,3] =f_3 (F[1,] , F[2,0], F[2,1], F[2,2])

F[2,3] =f_3 (F[1,

…

F[2,mc] =f_mc(F[1,*] , F[2,0], F[2,1], F[2,2]…,F[2,mc-1])

…

F[r,0] =f_0( F[r-1,*] )
F[r,1] =f_1 (F[r-1,*] , F[r,0])

F[r,2] =f_2 (F[r-1,

*] , F[r,0], F[r,1])*

F[r,3] =f_3 (F[r-1,] , F[r,0], F[r,1], F[r,2])

F[r,3] =f_3 (F[r-1,

…

F[r,mc] =f_mc(F[r-1,*] , F[r,0], F[r,1], F[r,2]…,F[r,mc-1])

…

F[mr,0] =f_0( F[mr-1,*])
F[mr,1] =f_1 (F[mr-1,*], F[mr,0])

F[mr,2] =f_2 (F[mr-1,

*], F[mr,0], F[mr,1])*

F[mr,3] =f_3 (F[mr-1,], F[mr,0], F[mr,1], F[mr,2])

F[mr,3] =f_3 (F[mr-1,

…

F[mr,mc] =f_mc(F[mr-1,*] F[mr,0], F[mr,1], F[mr,2]…,F[mr,mc-1])