How do some threads in a block get past __syncthreads, while others are still before it?

robosmith · July 20, 2015, 5:06pm

I put extraneous __syncthreads commands in my program for debugging purposes, to try to get all threads’ data to be in the (same) desired state.

However, I find that some threads have executed beyond the __syncthreads command, while others are still stopped at a breakpoint before the __syncthreads command.

Is this an artifact of debugging? Or is my understanding of how __syncthreads is supposed to work flawed?

Is there any way to ensure all threads’ data is in the same state while debugging?

njuffa · July 20, 2015, 7:44pm

This can happen when there is a call to __syncthreads() in a divergent control flow, which gives rise to undefined behavior. I would suggest running the code under control of cuda-memcheck to see what issues it reports.

little_jimmy · July 21, 2015, 4:56am

i agree with njuffa

but what does this mean exactly:

“to try to get all threads’ data to be in the (same) desired state”

the notion of adding “extraneous __syncthreads” for debugging purposes is new for me

robosmith · July 21, 2015, 6:39pm

I want variables for all threads to be calculated at a breakpoint so I don’t have to continually hit it over and over. If I don’t put in a __syncthreads before the point, when the breakpoint is first hit, some threads’ variables will not be updated to that point.

As far as __syncthreads not working as I expect, I cannot find any __syncthreads which are not hit by all threads. I always debug with cuda-memcheck and no issues are reported.

little_jimmy · July 22, 2015, 6:48am

“I want variables for all threads to be calculated at a breakpoint so I don’t have to continually hit it over and over.”

understood

“I always debug with cuda-memcheck”

i assume this includes racecheck
undefined behaviour is a loose term that may be ambiguous; on occasion i am inclined to include skipping barriers (syncthreads) under undefined behaviour
although the guides lists a number of cases and conditions that may lead to undefined behavior - i can think of such in the context of dynamic parallelism, memory and memory pointers, and stream-related apis - the 2 common causes of undefined behaviour i have come to pleasantly enjoy, are:

races/ poor synchronization
poor memory allocation

with regards to the former - if a kernel/ function shares a common control variable, set by one or a few threads, and synchronization is incomplete and thus fails, one half of the treads may read the control variable in its former state, and the other half of the threads may read it in its updated state, causing significant divergence, and it may seem as if barriers were jumped

robosmith · July 22, 2015, 4:56pm

The only thing I’m doing which might cause divergence is binary reduction in shared memory using

unsigned int sidhalf = w >> 1;
do
{
__syncthreads();
sidhalf >>= 1;
}while (sidhalf > 0);

But all threads should compute this conditional in the same way, so I don’t see how they get out of sync.

Maddy_Scientist · July 22, 2015, 6:24pm

So “w” in the above example is constant for all threads?

robosmith · July 22, 2015, 9:26pm

Yes, w is the width of the 2D shared memory array

little_jimmy · July 23, 2015, 5:11am

is w a constant, passed to the function, or calculated within the function?

if no races are reported, the other method i can think of is to start commenting out sections of the function’s code, to note the first point of departure, and thus the code section responsible for this
in a way, the compiler is ‘human’
you could easily replace values calculated by sections with constants, to ensure continuity to some degree

robosmith · July 23, 2015, 6:10pm

Yeah, w is a constant passed.

Actually, I am using cuda-memcheck in VS.

Maybe race checker is only in the external app?

Not sure how to use the external app in a mex function with Matlab.

Topic		Replies	Views
Does __syncthreads not work across multiple warps? CUDA Programming and Performance	9	3290	April 30, 2014
__syncthreads() + shared memory issue CUDA Programming and Performance	7	5618	August 26, 2008
Custom __syncthreads() with error detection? CUDA Programming and Performance	2	783	January 3, 2013
__syncthreads thread syncronization CUDA Programming and Performance	7	18589	October 27, 2009
A stupid question on __syncthread() function CUDA Programming and Performance	5	5308	May 17, 2022
Strange __syncthreads behavior CUDA Programming and Performance	2	1044	January 21, 2014
Unexpected algorithm behaviour CUDA Programming and Performance	9	1536	March 20, 2015
__syncthreads() not a subset of cudaDeviceSynchronize()? CUDA Programming and Performance	3	581	June 2, 2022
Can't get all threads to hit a syncthreads barrier correctly... CUDA Programming and Performance	3	1384	March 21, 2015
Heisenbug in CUDA 5.x? CUDA Programming and Performance	17	3183	July 31, 2013

How do some threads in a block get past __syncthreads, while others are still before it?

Related topics