thread control search for help

To generate the following 4*4 matrix
{3,5,7,9,
9,21,37,47,
31,89,173,221,
121,383,777,999
}
The arithmetic to generate the matrix in C code are as follows:
for(i=0;i<3;i++)//row
{
for(j=0;j<3;j++)//column
{ if (j==0)
k=buf[j]+buf[j+1]+1;
else
k=buf[j]+buf[j+1]+buf[j-1];

            buf[j]=k;
            d1[i*dx+j]=k;

}
buf[0]~buf[4]=1 at first
Problem:Because there are recursive computation in such arithmetic ,so I let one row compute its first 2 column,and then the next row starts work。Create 4 thread,one thread is actually one row.
CUDA code is :
int tid=threadIdx.x;//1 dimension,4 thread
int k=-1;//k is the column number of one thread
for(i=0;i<(3+2);i++)
{if(tid==i) active=1;//use active to control
__syncthreads();
for(j=0;j<2;j++)
{__syncthreads();
if(k>3) active=0;//if the column number of one thread exceeds 3,it was inactivate.
if(active==1)
{k++;
if(k==0)

                       result=buf_d[k]+buf_d[k+1]+1;
                    else
                      result=buf_d[k]+buf_d[k+1]+buf_d[k-1];

                         buf_d[k]=result;
                         
            

                    }
                     __syncthreads();

          }
      }

Only the data of first row is correct.I don’t know why.
3dfd.cu (1.67 KB)