EDIT: This has been solved. Read on if you want a nice puzzle…

Hello,

I think I may have found a compiler bug, but I just wanted to post this unless anyone else has ideas. So far I have only test on CUDA 8, and I am looking at the following code:

```
{
__shared__ volatile bool success;
__shared__ volatile int atomCount, bpos, nseg;
__shared__ volatile int CoreStart[11], CoreEnd[11], LCap[11], HCap[11];
__shared__ int4 idxBarriers[704];
// Thread indexing
int warpIdx = (threadIdx.x >> GRID_BITS);
int tgx = (threadIdx.x & GRID_BITS_MASK);
// Loop over all hash cells in the core of each sub-image
if (threadIdx.x == 0) {
bpos = blockIdx.x;
}
__syncthreads();
while (bpos < 2 * cSim.npencils) {
// Get the cell to import based on the Neutral Territory import
// table, with provisions for any Y-shifting in the cell grid.
// Determine the number of subdivisions that will be made.
int natom;
int4 limits;
if (threadIdx.x < GRID) {
if (threadIdx.x < 11) {
if (bpos < cSim.npencils) {
limits = cSim.pQQStencilLimits[11*bpos + threadIdx.x];
}
else {
limits = cSim.pLJStencilLimits[11*(bpos - cSim.npencils) + threadIdx.x];
}
natom = limits.w - limits.x;
}
else {
natom = 0;
}
natom += __shfl_down(natom, 8);
natom += __shfl_down(natom, 4);
natom += __shfl_down(natom, 2);
natom += __shfl_down(natom, 1);
if (threadIdx.x == 0) {
atomCount = natom;
}
}
__syncthreads();
// Check to see whether the entire contents of all 11 pencils will fit into
// the __shared__ space (MAX_NT_ATOMS atoms) of one non-bonded block. If so,
// this is a done deal--the information in the NTImports array is valid and
// nothing needs to be done. Otherwise (and this is where the majority of
// the complexity comes in), the goal is to step through the atoms in all 11
// pencils, enumerating each region of up to 255 between two X fractional
// coordinates.
if (atomCount > MAX_NT_ATOMS) {
// Establish boundaries in __shared__
if (threadIdx.x < 11) {
LCap[threadIdx.x] = limits.x;
CoreStart[threadIdx.x] = limits.y;
CoreEnd[threadIdx.x] = limits.z;
HCap[threadIdx.x] = limits.w;
}
else if (threadIdx.x == 32) {
success = false;
nseg = (atomCount + 479) / 480;
}
__syncthreads();
// Lay out Neutral Territory regions for this set of pencils
while (!success) {
// Assume this is going to work until proven otherwise
if (threadIdx.x == 0) {
success = true;
}
PMEFloat invTWidth = (PMEFloat)nseg / ((PMEFloat)1.0 +
((PMEFloat)2.0 * cSim.LowerCapBoundary));
// Initialize the barriers that this warp will set
int pos = tgx;
while (pos <= nseg) {
idxBarriers[11*pos + warpIdx].x = -1;
idxBarriers[11*pos + warpIdx].y = -1;
idxBarriers[11*pos + warpIdx].z = -1;
idxBarriers[11*pos + warpIdx].w = -1;
pos += GRID;
}
// Read X fractional coordinate data from global
int llim = LCap[warpIdx];
int hlim = HCap[warpIdx];
int readIdx = llim + tgx;
while (readIdx < hlim) {
PMEFloat xcrd;
if (bpos < cSim.npencils) {
xcrd = cSim.pHcmbQQFractX[readIdx];
}
else {
xcrd = cSim.pHcmbLJFractX[readIdx];
}
int ixcrd = (xcrd + cSim.LowerCapBoundary) * invTWidth;
int imxcrd = __shfl_up(ixcrd, 1);
if (readIdx == llim) {
idxBarriers[11*ixcrd + warpIdx].x = readIdx;
}
if (tgx > 0 && ixcrd > imxcrd) {
atomicMax(&idxBarriers[11*ixcrd + warpIdx].x, readIdx);
atomicMax(&idxBarriers[11*imxcrd + warpIdx].w, readIdx);
}
if (readIdx == hlim - 1) {
idxBarriers[11*ixcrd + warpIdx].w = hlim;
idxBarriers[11*(ixcrd + 1) + warpIdx].x = hlim;
}
readIdx += GRID - 1;
}
__syncthreads();
// Check to see that no segment is too large
int segIdx = warpIdx;
int segcount = invTWidth + 0.999999;
while (segIdx < segcount) {
int segpop;
if (tgx < 11) {
segpop = idxBarriers[11*segIdx + tgx].w - idxBarriers[11*segIdx + tgx].x;
}
else {
segpop = 0;
}
segpop += __shfl_down(segpop, 8);
segpop += __shfl_down(segpop, 4);
segpop += __shfl_down(segpop, 2);
segpop += __shfl_down(segpop, 1);
if (tgx == 0 && segpop > 510) {
success = false;
}
segIdx += 11;
}
__syncthreads();
// If this failed, increase the number of segments and try again.
// Otherwise commit the results.
if (success) {
// Determine where core atoms start and stop in each region.
// Use only the first 11 threads as nseg is likely to be small
// (less than 11 in nearly all cases) and this makes the
// memory access pattern better.
if (tgx < 11) {
int pos = tgx + 11*warpIdx;
int cs = CoreStart[tgx];
int ce = CoreEnd[tgx];
while (pos < 11 * nseg) {
if (cs < idxBarriers[pos].w && ce >= idxBarriers[pos].x) {
idxBarriers[pos].y = max(cs, idxBarriers[pos].x);
idxBarriers[pos].z = min(ce, idxBarriers[pos].w);
}
pos += 121;
}
}
// Now write the import instructions. Select one unit from the
// list, then a second (outer loop i = 1; i < nseg; i++, inner
// loop j = 0; j < i; j++) and make sure that the union of i
// and j has has a non-trivial content of core
// atoms. Then, loop over the list again, find a second region,
// make sure that at least some of its atoms are within range of
// the first region, and do not let it BE the first region. In
// order to get the core atoms of each region to interact with
// the
}
else {
if (threadIdx.x == 0) {
nseg++;
}
}
__syncthreads();
}
}
else {
// Write out the import instructions, working off the fact that
// the entire atom contents of all 11 hash cells will, in fact,
// fit into the one import region.
}
// Increment to the next assignment
if (threadIdx.x == 0) {
bpos = atomicAdd(&cSim.pFrcBlkCounters[3], 1);
}
__syncthreads();
}
}
```

Notice the if (success) near at the end. The overall structure is:

```
while (bpos < (# of pencil hash cells)) {
if (this is the first WARP) {
SET VALUE OF atomCount
}
__syncthreads()
if (atomCount < MAXIMUM_ALLOWED_VALUE) {
ESTABLISH BOUNDARIES FOR DATA READING IN __SHARED__
SET SUCCESS TO false
__syncthreads()
while (!success) {
SET SUCCESS TO true
INITIALIZE LIMITS THAT THIS WARP WILL DEAL WITH
READ DATA AND ESTABLISH LIMITS
__syncthreads()
CHECK BOUNDARIES AND TOTAL PARTICLE COUNT IN EACH BIN
if (success) {
UPDATE LIMITS IN HERE
< This is where I cannot put __syncthreads() >
< I would like to write out results to global in here but I would need __syncthreads() >
}
else {
INCREASE THE NUMBER OF BINS FOR MAKING FINER LIMITS
< This is another place I cannot put __syncthreads() >
< And, I did not foolishly test __syncthreads() inside the if (threadIdx.x == 0) nested branch >
}
__syncthreads()
}
}
else {
write out results to global
}
}
```

Now, what I’d like to do is have some __syncthreads() statements in that innermost if (success) branch, but it seems that if I do this the kernel will eventually lock. It may run all right the first time, but if I execute the kernel 2.5 milllion times I’ll eventually see the calculation hang. Similarly if I try putting a __syncthreads() in the else part of that same conditional. It’s quite reproducible to lock the kernel with __syncthreads() in either of the places I indicated, and then have it run fine, even 2.5 million times in a row, if I simply remove the offending __syncthreads() call. What’s got my very confused is that I’ve set success to be volatile, there is nothing inside of that sensitive if (success) statement to change the value of success, and there are other __syncthreads() around the whole thing to make sure that every thread, upon seeing that conditional branching, will read the same value of success from **shared**. All of the block’s threads should be marching through that conditional statement the same way.

I get similar results if I set success to be an integer and use values of 0 or 1, evaluating with ==.

Can anyone offer some observations here? I’ve got __syncthreads() all over this kernel, I just can’t seem to put it in two places where I would like to have it and would seem to be perfectly safe. I’m continuing to test, but as long as I don’t put __syncthreads() in either branch of that if (success) { … } / else { … }, I don’t ever seem to get a lock, and if I do it’ll inevitably lock. The kernel can be written such that I don’t need __syncthreads() in there: in fact, it’s a little cleaner if I just get that work done and then rely on the __syncthreads() after the conditional to line everything up again prior to final stage when I write results back to global. But, it sure would be nice to understand what’s going on.

Cheers!