I wrote this “b[i][k] = b[i-1][k] + some stuff” just to show that the ith element depends on the (i-1)th element…it is not exactly a “sum” what the algorithm those
Well, then you can keep the loop over i as a simple solution.
What is possible is the following : split k over you blocks, so keep all i’s belonging to the same k within 1 block. Then with your N threads within the block calculate the first N, then the second N, etc. etc. until you calculated all i’s. Then you can use synchtreads, to make sure all you N values have been calculated.