Just in case anyone else makes the same mistake…
I have a for loop which 64 threads execute 51 times.
The loop iterations are pretty much independent,
so I thought why not have 128 threads each doing
half the work.
for(i=0+x;i<=50;i+=2) {//x=0 if threadIdx.x<64 else x=1
The iterations use shared memory and __syncthreads().
Also was well for a bit but under weird circumstances
odd answers relating to i=50 would be produced.
BUT thinking __syncthreads() must be applied to all threads
(even on the last loop iteration)
What I have done is
for(i0=0;i0<=50;i0+=2) {
i=i0+x; //x=0 if threadIdx.x<64 else 1
And then trapped i>50.
The new loop ensures, even on the 26th interation (i0=50),
all 128 threads call __syncthreads().
The original loop (perhaps because it fitted onto warp boundaries???)
worked fine until an apparently unrelated change was made :-(
Perhaps there is a better way???
Bill