I put extraneous __syncthreads commands in my program for debugging purposes, to try to get all threads’ data to be in the (same) desired state.
However, I find that some threads have executed beyond the __syncthreads command, while others are still stopped at a breakpoint before the __syncthreads command.
Is this an artifact of debugging? Or is my understanding of how __syncthreads is supposed to work flawed?
Is there any way to ensure all threads’ data is in the same state while debugging?
This can happen when there is a call to __syncthreads() in a divergent control flow, which gives rise to undefined behavior. I would suggest running the code under control of cuda-memcheck to see what issues it reports.
I want variables for all threads to be calculated at a breakpoint so I don’t have to continually hit it over and over. If I don’t put in a __syncthreads before the point, when the breakpoint is first hit, some threads’ variables will not be updated to that point.
As far as __syncthreads not working as I expect, I cannot find any __syncthreads which are not hit by all threads. I always debug with cuda-memcheck and no issues are reported.
“I want variables for all threads to be calculated at a breakpoint so I don’t have to continually hit it over and over.”
understood
“I always debug with cuda-memcheck”
i assume this includes racecheck
undefined behaviour is a loose term that may be ambiguous; on occasion i am inclined to include skipping barriers (syncthreads) under undefined behaviour
although the guides lists a number of cases and conditions that may lead to undefined behavior - i can think of such in the context of dynamic parallelism, memory and memory pointers, and stream-related apis - the 2 common causes of undefined behaviour i have come to pleasantly enjoy, are:
with regards to the former - if a kernel/ function shares a common control variable, set by one or a few threads, and synchronization is incomplete and thus fails, one half of the treads may read the control variable in its former state, and the other half of the threads may read it in its updated state, causing significant divergence, and it may seem as if barriers were jumped
is w a constant, passed to the function, or calculated within the function?
if no races are reported, the other method i can think of is to start commenting out sections of the function’s code, to note the first point of departure, and thus the code section responsible for this
in a way, the compiler is ‘human’
you could easily replace values calculated by sections with constants, to ensure continuity to some degree