I am looking for a solution to a problem,recast in cuda.
where a grid of data items, row by colunm, needs to processed
rows r=0 to 1440-1 and columns c from 0 to 32-1
F[r,c] r=0,1440-1,c=0,31
The job is to do F of 1440 * 32 items.
RULE
F[r,c] can be evaluated if
- The preceeding row F[r-1,*] is done
- The preceeding columns [r,j] j<c done within the same row
- Assume r=0 is done at the start.
In short, a given row can only be processed if the preceeding row
has been processed, and within a row, an item can be processed if
all the preceeding columns in that row are known.
This, I believe, demands processing in lexagraphical order by (r,c).
I cant see any way to leverage any parallelism.
Each column c has a function f_c and may be defined using any values from the
immediately preceeeding row, and the preceeding columns within the given row.
If we have 32 columns, then we have 32 such functions.
maxcol = 32
mc=maxcol-1
maxrows =64
mr=maxrows-1
// asume {F[0,0], F[0,1] … F[0,mc]} = F[0,*]
F[1,0] =f_0( F[0,])
F[1,1] =f_1 (F[0,] , F[1,0])
F[1,2] =f_2 (F[0,] , F[1,0], F[1,1])
F[1,3] =f_3 (F[0,] , F[1,0], F[1,1], F[1,2])
…
F[1,mc] =f_mc(F[0,*] , F[1,0], F[1,1], F[1,2]…,F[1,mc-1])
F[2,0] =f_0( F[1,])
F[2,1] =f_1 (F[1,) , F[2,0])
F[2,2] =f_2 (F[1,] , F[2,0], F[2,1])
F[2,3] =f_3 (F[1,] , F[2,0], F[2,1], F[2,2])
…
F[2,mc] =f_mc(F[1,*] , F[2,0], F[2,1], F[2,2]…,F[2,mc-1])
…
F[r,0] =f_0( F[r-1,] )
F[r,1] =f_1 (F[r-1,] , F[r,0])
F[r,2] =f_2 (F[r-1,] , F[r,0], F[r,1])
F[r,3] =f_3 (F[r-1,] , F[r,0], F[r,1], F[r,2])
…
F[r,mc] =f_mc(F[r-1,*] , F[r,0], F[r,1], F[r,2]…,F[r,mc-1])
…
F[mr,0] =f_0( F[mr-1,])
F[mr,1] =f_1 (F[mr-1,], F[mr,0])
F[mr,2] =f_2 (F[mr-1,], F[mr,0], F[mr,1])
F[mr,3] =f_3 (F[mr-1,], F[mr,0], F[mr,1], F[mr,2])
…
F[mr,mc] =f_mc(F[mr-1,*] F[mr,0], F[mr,1], F[mr,2]…,F[mr,mc-1])