How do some threads in a block get past __syncthreads, while others are still before it?

I put extraneous __syncthreads commands in my program for debugging purposes, to try to get all threads’ data to be in the (same) desired state.

However, I find that some threads have executed beyond the __syncthreads command, while others are still stopped at a breakpoint before the __syncthreads command.

Is this an artifact of debugging? Or is my understanding of how __syncthreads is supposed to work flawed?

Is there any way to ensure all threads’ data is in the same state while debugging?

This can happen when there is a call to __syncthreads() in a divergent control flow, which gives rise to undefined behavior. I would suggest running the code under control of cuda-memcheck to see what issues it reports.

i agree with njuffa

but what does this mean exactly:

“to try to get all threads’ data to be in the (same) desired state”

the notion of adding “extraneous __syncthreads” for debugging purposes is new for me

I want variables for all threads to be calculated at a breakpoint so I don’t have to continually hit it over and over. If I don’t put in a __syncthreads before the point, when the breakpoint is first hit, some threads’ variables will not be updated to that point.

As far as __syncthreads not working as I expect, I cannot find any __syncthreads which are not hit by all threads. I always debug with cuda-memcheck and no issues are reported.

“I want variables for all threads to be calculated at a breakpoint so I don’t have to continually hit it over and over.”

understood

“I always debug with cuda-memcheck”

i assume this includes racecheck
undefined behaviour is a loose term that may be ambiguous; on occasion i am inclined to include skipping barriers (syncthreads) under undefined behaviour
although the guides lists a number of cases and conditions that may lead to undefined behavior - i can think of such in the context of dynamic parallelism, memory and memory pointers, and stream-related apis - the 2 common causes of undefined behaviour i have come to pleasantly enjoy, are:

races/ poor synchronization
poor memory allocation

with regards to the former - if a kernel/ function shares a common control variable, set by one or a few threads, and synchronization is incomplete and thus fails, one half of the treads may read the control variable in its former state, and the other half of the threads may read it in its updated state, causing significant divergence, and it may seem as if barriers were jumped

The only thing I’m doing which might cause divergence is binary reduction in shared memory using

unsigned int sidhalf = w >> 1;
do
{
__syncthreads();
sidhalf >>= 1;
}while (sidhalf > 0);

But all threads should compute this conditional in the same way, so I don’t see how they get out of sync.

So “w” in the above example is constant for all threads?

Yes, w is the width of the 2D shared memory array

is w a constant, passed to the function, or calculated within the function?

if no races are reported, the other method i can think of is to start commenting out sections of the function’s code, to note the first point of departure, and thus the code section responsible for this
in a way, the compiler is ‘human’
you could easily replace values calculated by sections with constants, to ensure continuity to some degree

Yeah, w is a constant passed.

Actually, I am using cuda-memcheck in VS.

Maybe race checker is only in the external app?

Not sure how to use the external app in a mex function with Matlab.