How is this kernel locking with __syncthreads()?

EDIT: This has been solved. Read on if you want a nice puzzle…

Hello,

I think I may have found a compiler bug, but I just wanted to post this unless anyone else has ideas. So far I have only test on CUDA 8, and I am looking at the following code:

{
  __shared__ volatile bool success;
  __shared__ volatile int atomCount, bpos, nseg;
  __shared__ volatile int CoreStart[11], CoreEnd[11], LCap[11], HCap[11];
  __shared__ int4 idxBarriers[704];
  
  // Thread indexing
  int warpIdx = (threadIdx.x >> GRID_BITS);
  int tgx     = (threadIdx.x & GRID_BITS_MASK);

  // Loop over all hash cells in the core of each sub-image
  if (threadIdx.x == 0) {
    bpos = blockIdx.x;
  }
  __syncthreads();
  while (bpos < 2 * cSim.npencils) {
    
    // Get the cell to import based on the Neutral Territory import
    // table, with provisions for any Y-shifting in the cell grid.
    // Determine the number of subdivisions that will be made.
    int natom;
    int4 limits;
    if (threadIdx.x < GRID) {
      if (threadIdx.x < 11) {
        if (bpos < cSim.npencils) {
          limits = cSim.pQQStencilLimits[11*bpos + threadIdx.x];
        }
        else {
          limits = cSim.pLJStencilLimits[11*(bpos - cSim.npencils) + threadIdx.x];
        }
        natom = limits.w - limits.x;
      }
      else {
        natom = 0;
      }
      natom += __shfl_down(natom, 8);
      natom += __shfl_down(natom, 4);
      natom += __shfl_down(natom, 2);
      natom += __shfl_down(natom, 1);
      if (threadIdx.x == 0) {
        atomCount = natom;
      }
    }
    __syncthreads();
    
    // Check to see whether the entire contents of all 11 pencils will fit into
    // the __shared__ space (MAX_NT_ATOMS atoms) of one non-bonded block.  If so,
    // this is a done deal--the information in the NTImports array is valid and
    // nothing needs to be done.  Otherwise (and this is where the majority of
    // the complexity comes in), the goal is to step through the atoms in all 11
    // pencils, enumerating each region of up to 255 between two X fractional
    // coordinates.
    if (atomCount > MAX_NT_ATOMS) {

      // Establish boundaries in __shared__
      if (threadIdx.x < 11) {
        LCap[threadIdx.x]      = limits.x;
        CoreStart[threadIdx.x] = limits.y;
        CoreEnd[threadIdx.x]   = limits.z;
        HCap[threadIdx.x]      = limits.w;
      }
      else if (threadIdx.x == 32) {
        success = false;
        nseg = (atomCount + 479) / 480;
      }
      __syncthreads();
      
      // Lay out Neutral Territory regions for this set of pencils
      while (!success) {

        // Assume this is going to work until proven otherwise
        if (threadIdx.x == 0) {
          success = true;
        }
        PMEFloat invTWidth = (PMEFloat)nseg / ((PMEFloat)1.0 +
                                               ((PMEFloat)2.0 * cSim.LowerCapBoundary));

        // Initialize the barriers that this warp will set
        int pos = tgx;
        while (pos <= nseg) {
          idxBarriers[11*pos + warpIdx].x = -1;
          idxBarriers[11*pos + warpIdx].y = -1;
          idxBarriers[11*pos + warpIdx].z = -1;
          idxBarriers[11*pos + warpIdx].w = -1;
          pos += GRID;
        }
        
        // Read X fractional coordinate data from global
        int llim = LCap[warpIdx];
        int hlim = HCap[warpIdx];
        int readIdx = llim + tgx;
        while (readIdx < hlim) {
          PMEFloat xcrd;
          if (bpos < cSim.npencils) {
            xcrd = cSim.pHcmbQQFractX[readIdx];
          }
          else {
            xcrd = cSim.pHcmbLJFractX[readIdx];
          }
          int ixcrd = (xcrd + cSim.LowerCapBoundary) * invTWidth;
          int imxcrd = __shfl_up(ixcrd, 1);
          if (readIdx == llim) {
            idxBarriers[11*ixcrd + warpIdx].x = readIdx;
          }
          if (tgx > 0 && ixcrd > imxcrd) {
            atomicMax(&idxBarriers[11*ixcrd  + warpIdx].x, readIdx);
            atomicMax(&idxBarriers[11*imxcrd + warpIdx].w, readIdx);
          }
          if (readIdx == hlim - 1) {
            idxBarriers[11*ixcrd       + warpIdx].w = hlim;
            idxBarriers[11*(ixcrd + 1) + warpIdx].x = hlim;
          }
          readIdx += GRID - 1;
        }
        __syncthreads();
        
        // Check to see that no segment is too large
        int segIdx = warpIdx;
        int segcount = invTWidth + 0.999999;
        while (segIdx < segcount) {
          int segpop;
          if (tgx < 11) {
            segpop = idxBarriers[11*segIdx + tgx].w - idxBarriers[11*segIdx + tgx].x;
          }
          else {
            segpop = 0;
          }
          segpop += __shfl_down(segpop, 8);
          segpop += __shfl_down(segpop, 4);
          segpop += __shfl_down(segpop, 2);
          segpop += __shfl_down(segpop, 1);
          if (tgx == 0 && segpop > 510) {
            success = false;
          }
          segIdx += 11;
        }
        __syncthreads();
        
        // If this failed, increase the number of segments and try again.
        // Otherwise commit the results.
        if (success) {

          // Determine where core atoms start and stop in each region.
          // Use only the first 11 threads as nseg is likely to be small
          // (less than 11 in nearly all cases) and this makes the
          // memory access pattern better.
          if (tgx < 11) {
            int pos = tgx + 11*warpIdx;
            int cs = CoreStart[tgx];
            int ce = CoreEnd[tgx];
            while (pos < 11 * nseg) {
              if (cs < idxBarriers[pos].w && ce >= idxBarriers[pos].x) {
                idxBarriers[pos].y = max(cs, idxBarriers[pos].x);
                idxBarriers[pos].z = min(ce, idxBarriers[pos].w);
              }
              pos += 121;
            }
          }
	  
          // Now write the import instructions.  Select one unit from the
          // list, then a second (outer loop i = 1; i < nseg; i++, inner
          // loop j = 0; j < i; j++) and make sure that the union of i
          // and j has has a non-trivial content of core
          // atoms.  Then, loop over the list again, find a second region,
          // make sure that at least some of its atoms are within range of
          // the first region, and do not let it BE the first region.  In
          // order to get the core atoms of each region to interact with
          // the 
        }
        else {
	  if (threadIdx.x == 0) {
            nseg++;
	  }
        }
	__syncthreads();
      }
    }
    else {

      // Write out the import instructions, working off the fact that
      // the entire atom contents of all 11 hash cells will, in fact,
      // fit into the one import region.
    }

    // Increment to the next assignment
    if (threadIdx.x == 0) {
      bpos = atomicAdd(&cSim.pFrcBlkCounters[3], 1);
    }
    __syncthreads();
  }
}

Notice the if (success) near at the end. The overall structure is:

while (bpos < (# of pencil hash cells)) {
  if (this is the first WARP) {
    SET VALUE OF atomCount
  }
  __syncthreads()
  if (atomCount < MAXIMUM_ALLOWED_VALUE) {
    ESTABLISH BOUNDARIES FOR DATA READING IN __SHARED__
    SET SUCCESS TO false
    __syncthreads()
    while (!success) {
      SET SUCCESS TO true
      INITIALIZE LIMITS THAT THIS WARP WILL DEAL WITH
      READ DATA AND ESTABLISH LIMITS
      __syncthreads()
      CHECK BOUNDARIES AND TOTAL PARTICLE COUNT IN EACH BIN
      if (success) {
        UPDATE LIMITS IN HERE

        < This is where I cannot put __syncthreads() >

        < I would like to write out results to global in here but I would need __syncthreads() >
      }
      else {
        INCREASE THE NUMBER OF BINS FOR MAKING FINER LIMITS

        < This is another place I cannot put __syncthreads() >
        < And, I did not foolishly test __syncthreads() inside the if (threadIdx.x == 0) nested branch >
      }
      __syncthreads()
    }
  }
  else {
    write out results to global
  }
}

Now, what I’d like to do is have some __syncthreads() statements in that innermost if (success) branch, but it seems that if I do this the kernel will eventually lock. It may run all right the first time, but if I execute the kernel 2.5 milllion times I’ll eventually see the calculation hang. Similarly if I try putting a __syncthreads() in the else part of that same conditional. It’s quite reproducible to lock the kernel with __syncthreads() in either of the places I indicated, and then have it run fine, even 2.5 million times in a row, if I simply remove the offending __syncthreads() call. What’s got my very confused is that I’ve set success to be volatile, there is nothing inside of that sensitive if (success) statement to change the value of success, and there are other __syncthreads() around the whole thing to make sure that every thread, upon seeing that conditional branching, will read the same value of success from shared. All of the block’s threads should be marching through that conditional statement the same way.

I get similar results if I set success to be an integer and use values of 0 or 1, evaluating with ==.

Can anyone offer some observations here? I’ve got __syncthreads() all over this kernel, I just can’t seem to put it in two places where I would like to have it and would seem to be perfectly safe. I’m continuing to test, but as long as I don’t put __syncthreads() in either branch of that if (success) { … } / else { … }, I don’t ever seem to get a lock, and if I do it’ll inevitably lock. The kernel can be written such that I don’t need __syncthreads() in there: in fact, it’s a little cleaner if I just get that work done and then rely on the __syncthreads() after the conditional to line everything up again prior to final stage when I write results back to global. But, it sure would be nice to understand what’s going on.

Cheers!

Curiouser and curiouser… I’m now looking at the following code, and it will lock if I place a __syncthreads() at the stated place (see near the very bottom). This is right above another __syncthreads() that appears to be perfectly benign!

{
  __shared__ volatile bool success;
  __shared__ volatile int atomCount, bpos, nseg;
  __shared__ volatile int CoreStart[11], CoreEnd[11], LCap[11], HCap[11];
  __shared__ int4 idxBarriers[704];
  
  // Thread indexing
  int warpIdx = (threadIdx.x >> GRID_BITS);
  int tgx     = (threadIdx.x & GRID_BITS_MASK);

  // Loop over all hash cells in the core of each sub-image
  if (threadIdx.x == 0) {
    bpos = blockIdx.x;
  }
  __syncthreads();
  while (bpos < 2 * cSim.npencils) {

    // Get the cell to import based on the Neutral Territory import
    // table, with provisions for any Y-shifting in the cell grid.
    // Determine the number of subdivisions that will be made.
    int natom;
    int4 limits;
    if (threadIdx.x < GRID) {
      if (threadIdx.x < 11) {
        if (bpos < cSim.npencils) {
          limits = cSim.pQQStencilLimits[11*bpos + threadIdx.x];
        }
        else {
          limits = cSim.pLJStencilLimits[11*(bpos - cSim.npencils) + threadIdx.x];
        }
        natom = limits.w - limits.x;
      }
      else {
        natom = 0;
      }
      natom += __shfl_down(natom, 8);
      natom += __shfl_down(natom, 4);
      natom += __shfl_down(natom, 2);
      natom += __shfl_down(natom, 1);
      if (threadIdx.x == 0) {
        atomCount = natom;
      }
    }
    __syncthreads();
    
    // Check to see whether the entire contents of all 11 pencils will fit into
    // the __shared__ space (MAX_NT_ATOMS atoms) of one non-bonded block.  If so,
    // this is a done deal--the information in the NTImports array is valid and
    // nothing needs to be done.  Otherwise (and this is where the majority of
    // the complexity comes in), the goal is to step through the atoms in all 11
    // pencils, enumerating each region of up to 255 between two X fractional
    // coordinates.
    if (atomCount > MAX_NT_ATOMS) {

      // Establish boundaries in __shared__
      if (threadIdx.x < 11) {
        LCap[threadIdx.x]      = limits.x;
        CoreStart[threadIdx.x] = limits.y;
        CoreEnd[threadIdx.x]   = limits.z;
        HCap[threadIdx.x]      = limits.w;
      }
      else if (threadIdx.x == 32) {
        success = false;
        nseg = (atomCount + 479) / 480;
      }
      __syncthreads();
      
      // Lay out Neutral Territory regions for this set of pencils
      while (!success) {

        // Assume this is going to work until proven otherwise
        if (threadIdx.x == 0) {
          success = true;
        }
        PMEFloat invTWidth = (PMEFloat)nseg / ((PMEFloat)1.0 +
                                               ((PMEFloat)2.0 * cSim.LowerCapBoundary));

        // Initialize the barriers that this warp will set
        int pos = tgx;
        while (pos <= nseg) {
          idxBarriers[11*pos + warpIdx].x = -1;
          idxBarriers[11*pos + warpIdx].y = -1;
          idxBarriers[11*pos + warpIdx].z = -1;
          idxBarriers[11*pos + warpIdx].w = -1;
          pos += GRID;
        }
        
        // Read X fractional coordinate data from global
        int llim = LCap[warpIdx];
        int hlim = HCap[warpIdx];
        int readIdx = llim + tgx;
        while (readIdx < hlim) {
          PMEFloat xcrd;
          if (bpos < cSim.npencils) {
            xcrd = cSim.pHcmbQQFractX[readIdx];
          }
          else {
            xcrd = cSim.pHcmbLJFractX[readIdx];
          }
          int ixcrd = (xcrd + cSim.LowerCapBoundary) * invTWidth;
          int imxcrd = __shfl_up(ixcrd, 1);
          if (readIdx == llim) {
            idxBarriers[11*ixcrd + warpIdx].x = readIdx;
          }
          if (tgx > 0 && ixcrd > imxcrd) {
            atomicMax(&idxBarriers[11*ixcrd  + warpIdx].x, readIdx);
            atomicMax(&idxBarriers[11*imxcrd + warpIdx].w, readIdx);
          }
          if (readIdx == hlim - 1) {
            idxBarriers[11*ixcrd       + warpIdx].w = hlim;
            idxBarriers[11*(ixcrd + 1) + warpIdx].x = hlim;
          }
          readIdx += GRID - 1;
        }
        __syncthreads();
        
        // Check to see that no segment is too large
        int segIdx = warpIdx;
        int segcount = invTWidth + 0.999999;
        while (segIdx < segcount) {
          int segpop;
          if (tgx < 11) {
            segpop = idxBarriers[11*segIdx + tgx].w - idxBarriers[11*segIdx + tgx].x;
          }
          else {
            segpop = 0;
          }
          segpop += __shfl_down(segpop, 8);
          segpop += __shfl_down(segpop, 4);
          segpop += __shfl_down(segpop, 2);
          segpop += __shfl_down(segpop, 1);
          if (tgx == 0 && segpop > 510) {
            success = false;
          }
          segIdx += 11;
        }
        __syncthreads();
        
        // If this failed, increase the number of segments and try again.
        // Otherwise commit the results.
        if (success) {

          // Determine where core atoms start and stop in each region.
          // Use only the first 11 threads as nseg is likely to be small
          // (less than 11 in nearly all cases) and this makes the
          // memory access pattern better.
          if (tgx < 11) {
            int pos = tgx + 11*warpIdx;
            int cs = CoreStart[tgx];
            int ce = CoreEnd[tgx];
            while (pos < 11 * nseg) {
              if (cs < idxBarriers[pos].w && ce >= idxBarriers[pos].x) {
                idxBarriers[pos].y = max(cs, idxBarriers[pos].x);
                idxBarriers[pos].z = min(ce, idxBarriers[pos].w);
              }
              pos += 121;
            }
          }
        }
        else {
	  if (threadIdx.x == 0) {
            nseg++;
	  }
        }
	__syncthreads();
      }
    }
    else {

      // If the above wasn't necessary (all of the hash cells' atoms will fit
      // conveniently into the import region), then the information stored in
      // limits can be loaded directly into the idxBarriers array.
      if (threadIdx.x < 11) {
        idxBarriers[threadIdx.x].x = limits.x;
        idxBarriers[threadIdx.x].y = limits.y;
        idxBarriers[threadIdx.x].z = limits.z;
        idxBarriers[threadIdx.x].w = limits.w;
      }
      if (threadIdx.x == 0) {
	nseg = 1;
      }
    }

    // CANNOT PUT A __SYNCTHREADS HERE !!!
    __syncthreads();
    // COMMENT OUT THAT LAST __SYNCTHREADS AND ALL IS FINE, LEAVE IT IN PLACE AND KERNEL LOCKS

    // Write out the import instructions, working off the fact that
    // the entire atom contents of all 11 hash cells will, in fact,
    // fit into the one import region.

// Increment to the next assignment
    if (threadIdx.x == 0) {
      bpos = atomicAdd(&cSim.pFrcBlkCounters[3], 1);
    }
    __syncthreads();
  }
}

This is remarkable… I should add that if I comment out everything between lines 17 and 183, that is add #if 0 / #endif to those lines, things again run fine, as of course they should. But I am beginning to wonder if there is some limit on the number of __syncthreads() calls I am allowed to make in one kernel–it’s probably not the answer but something very odd is at work.

OK it seems I’ve found the problem. Up at lines 69-75, what can happen is that threads (particularly in the first warp) can enter the while loop with success == false, set success = true, then hit the syncthreads() below. If, however, some threads didn’t yet make it INSIDE the while loop before success got set to true, they got locked out and the kernel locks up.

Problem solved!