__syncthreads() not syncing all threads in my thread block

I am stumped, it appears to me that about 3% of the time, __syncthreads() is failing to synchronize my thread block. Here is some cut down code and output to demonstrate:

__device__ void d_find_inflection_points(const short nSamples, const short* adc,
                                         const short dfno_sm, const bool fwd, 
                                         const device_memory_toc_t* myMTOC,
                                         short* sole, short* pale) {

    const byte_t ch = threadIdx.y;	
	...
    __shared__ int last_baseline_idx[ch_cnt];
    ...
    if (threadIdx.x == 0) {
        if (fwd) {
            last_baseline_idx[ch] = 0;
            ...
        } else {
            last_baseline_idx[ch] = nSamples;
            ...
        }
    }
    __syncthreads();
    int local_last_baseline_idx = last_baseline_idx[ch];
    ...

    ... Thread local logic to assign local_last_baseline_idx ...
	
    if (adc[local_last_baseline_idx] < baseline) {
        if (fwd) {
            atomicMax(&last_baseline_idx[ch], local_last_baseline_idx);
        } else {
            atomicMin(&last_baseline_idx[ch], local_last_baseline_idx);
    
        }
    }
    __syncthreads();
    if (fwd && (threadIdx.x == 50 || threadIdx.x == 30)) {
        printf("%04d/%1d/%02d: EOFE: adc[%04d] = %04d < %04.3f sm: %d; i_limit: %d; blockIdx: %d, %d \n", 
            myMTOC->img_results->id, ch, threadIdx.x, last_baseline_idx[ch], adc[last_baseline_idx[ch]], baseline,
            dfno_sm, i_limit, blockIdx.x, blockIdx.y
            );
    }
....
}

In about 3% of my results I see an error like this:

0502/1/30: EOFE: adc[0030] = 1476 < 1484.188 sm: 58; i_limit: 1; blockIdx: 502, 0 
0502/1/50: EOFE: adc[0057] = 1248 < 1484.188 sm: 58; i_limit: 1; blockIdx: 502, 0 

[d_find_inflection_points() is called twice, first with fwd = true and then a second time with fwd = false. I am only focusing on the fwd = true case right now.]

Note that in thread # 30, last_baseline_idx[ch] is equal to 30, but in thread # 50 last_baseline_idx[ch] is equal to 57. But last_baseline_idx[ch] is shared memory, and I just called __syncthreads() before reading last_baseline_idx, so how is it possible that it could have different values? Note at the end of the output string, I verify that both lines were written by the same thread block.

I will also note that I can run this multiple times with the same data set and the problem will move to another blockIdx each time. So this is not bad data, but seems to have to do with execution order.

Any help would be appreciated.

I would suggest that you make sure that your printf format specifiers are perfectly correct for the things you are trying to print. a %d format specifier is not correct for a byte quantity, for example. Alternatively, cast everything that you are printing to int.

Thank you. In the end, I discovered that I had an error elsewhere in my code and it was just manifesting itself here.