Tridiagonal solve synchronization

I’m working on a tridiag solve algorithm for big (m=1e6, n=32) systems and I’m encountering a significant problem:
The format of CR (and I believe PCR) algorithms for large tridiagonal systems is as follows:
We start with 64,say, blocks.
After 1 step, we utilize 32 blocks

block 0 uses step 1 output from blocks 0 and 1
block 1 idles
block 2 uses step 1 output from blocks 2 and 3
block 3 idles
and so on. The point being, there’s inter-dependence with the blocks - significant race conditions occur here. I am looking for a way to either sync all blocks, or a work around that doesn’t need synchronization. The best solution I’ve come up with so far is to run a new kernel at each step of execution with a sync stream call, but to me this is very inefficient.

Also a small aside: I’ve posted a consultation request regarding this exact problem (and a few others) on the jobs page, and the offer still stands to anyone who wants to lend a hand and earn some spare change ;)