i have no idea what ‘pipe-lining it’ means

the size is somewhat small, nevertheless:

assume that, with your kernel dimensions, you manage to cover a row (all columns), and a number of rows

so, if you take the matrix and divide it into a number of sub-matrices - each a row multiple - you can start kernels working in on the sub-matrices as row-multiples, sequentially

but this is only if you iterate

assume you have a matrix of 75 rows; assume a kernel covers 5 rows

you can parallel expose the matrix with 15 kernels (not that parallel exposing - ‘covering’ - the matrix with only 1 kernel is not possible)

with iterations, you can exploit the fact that a) each iteration has row operations, b) the n row operation of the next iteration can start once the n + 1 row operation of the current iteration is complete

this way, you would be able to parallel expose row operations of iterations, instead of merely row operations of an iteration, for each iteration

you would require streams to achieve and implement both the required concurrency and synchronization

below an elementary demonstration of a 10 row matrix, with a row multiple of 5, with 2 iterations issued in 2 streams - S1, S2

it is taken that the row dependency is simply n, not n - 1, n + 1 as in your case

kernel(iteration 1: rows 0 - 4) S1

kernel(iteration 1: rows 5 - 9) S2

kernel(iteration 2: rows 0 - 4) S1

kernel(iteration 2: rows 5 - 9) S2

…