bugfix for loop __syncthreads()

Just in case anyone else makes the same mistake…

I have a for loop which 64 threads execute 51 times.
The loop iterations are pretty much independent,
so I thought why not have 128 threads each doing
half the work.

for(i=0+x;i<=50;i+=2) {//x=0 if threadIdx.x<64 else x=1

The iterations use shared memory and __syncthreads().
Also was well for a bit but under weird circumstances
odd answers relating to i=50 would be produced.

BUT thinking __syncthreads() must be applied to all threads
(even on the last loop iteration)

What I have done is

for(i0=0;i0<=50;i0+=2) {
i=i0+x; //x=0 if threadIdx.x<64 else 1

And then trapped i>50.
The new loop ensures, even on the 26th interation (i0=50),
all 128 threads call __syncthreads().

The original loop (perhaps because it fitted onto warp boundaries???)
worked fine until an apparently unrelated change was made :-(

Perhaps there is a better way???

Any comments?
Anyone else had something similar?
Ideas for how to diagnose it? (Are there any tools to do it)
Thanks again