Are __device__ functions with __syncthreads() a bad idea?

Hello,

I have the following __device__ __forceinline__ function, which seems to work fine when I unit test it, and on the FIRST pass through a loop in one of the kernels that calls it. However, on subsequent passes, I start to see very strange behavior. The function performs ye olde in-place prefix sum on data in shared.

/// \brief Compute the prefix sum over an array of values at the block level.  It is expected that
///        all memory used by this routine be exclusive to one block.  Whatever type is used to
///        represent the data must be able to hold the sum of all data.  The thread block must be
///        sized as a multiple of the warp width.  This will compute an exclusive prefix sum and
///        leave the result in place of the original data, with the total of all values returned.
///
/// \param v  The array of data elements
/// \param s1  The first array of scratch values, allocated with at least one value for every
///            warp width or partial wapr width worth of values in v
/// \param s2  The second array of scratch values, allocated with at least one value for every
///            warp width or partial warp width worth of values in s1, and at most as many values
///            as the warp has lanes.
/// \param n   The total number of values in v
template <typename T> __device__ __forceinline__
T blockExclusivePrefixSum(T* v, T* s1, T* s2, const int n) {
  T result = (T)(0);
  if (n < warp_size_int && threadIdx.x < warp_size_int) {
    T var = (threadIdx.x < n) ? v[threadIdx.x] : (T)(0);
    EXCLUSIVE_WARP_PREFIXSUM_SAVETOTAL(var, threadIdx.x, result);
    v[threadIdx.x] = var;
  }
  else {
    const int lane_idx = (threadIdx.x & warp_bits_mask_int);
    const int nbatch = ((n + warp_bits_mask_int) >> warp_bits);
    for (int warp_pos = (threadIdx.x >> warp_bits); warp_pos < nbatch;
         warp_pos += (blockDim.x >> warp_bits)) {
      const int idx_test = (warp_pos << warp_bits) + lane_idx; 
      T var = (idx_test < n) ? v[idx_test] : (T)(0);
      T warp_total;
      EXCLUSIVE_WARP_PREFIXSUM_SAVETOTAL(var, lane_idx, warp_total);
      if (idx_test < n) {
        v[idx_test] = var;
      }
      if (lane_idx == 0) {
        s1[warp_pos] = warp_total;
      }
    }
    __syncthreads();
    if (n > warp_size_int * warp_size_int) {
      const int nbundle = ((nbatch + warp_bits_mask_int) >> warp_bits);
      for (int warp_pos = (threadIdx.x >> warp_bits); warp_pos < nbundle; 
           warp_pos += (blockDim.x >> warp_bits)) {
        const int idx_test = (warp_pos << warp_bits) + lane_idx;
        T var = (idx_test < nbatch) ? s1[idx_test] : (T)(0);
        T warp_total;
        EXCLUSIVE_WARP_PREFIXSUM_SAVETOTAL(var, lane_idx, warp_total);
        if (idx_test < nbatch) {
          s1[idx_test] = var;
        }
        s2[warp_pos] = warp_total;
      }
      __syncthreads();
      if (threadIdx.x < warp_size_int) {
        T var = (threadIdx.x < nbundle) ? s2[threadIdx.x] : (T)(0);
        EXCLUSIVE_WARP_PREFIXSUM_SAVETOTAL(var, lane_idx, result);
        if (threadIdx.x < nbundle) {
          s2[threadIdx.x] = var;
        }
      }
      __syncthreads();
      for (int warp_pos = (threadIdx.x >> warp_bits); warp_pos < nbundle;
           warp_pos += (blockDim.x >> warp_bits)) {
        const int idx_test = (warp_pos << warp_bits) + lane_idx;
        if (idx_test < nbatch) {
          s1[idx_test] += s2[warp_pos];
        }
      }
      __syncthreads();
    }
    else {
      if (threadIdx.x < warp_size_int) {
        T var = (threadIdx.x < nbatch) ? s1[threadIdx.x] : (T)(0);
        EXCLUSIVE_WARP_PREFIXSUM_SAVETOTAL(var, lane_idx, result);
        if (threadIdx.x < nbatch) {
          s1[threadIdx.x] = var;
        }
      }
      __syncthreads();
    }
    for (int warp_pos = (threadIdx.x >> warp_bits); warp_pos < nbatch; 
         warp_pos += (blockDim.x >> warp_bits)) {
      const T boost = s1[warp_pos];
      v[(warp_pos << warp_bits) + lane_idx] += boost;
    }
  }
  __syncthreads();
  return result;
}

The behavior I am seeing might be described as this. If I implement a calling function as follows:

__global__ void mykernel() {
  __shared__ int v[1024], s1[32]. s2[32];
  __shared__ int cell_na;

  // Decide on the value of cell_na, which will become the length of the prefix sum
  // Main loop
  int pos = threadIdx.x;  
  while (pos < 1000) {

    // Initialize the prefix sum
    __syncthreads();
    for (int i = threadIdx.x; i < 1024; i += blockDim.x) {
      v[i] = 0;

      // CHECK (Part A)
      printf("Initialize v[%4d] = 0\n", i);
      // END CHECK
    }
    __syncthreads();

    // Fill the bins, e.g. v[0] := 15, v[1] :=14.  This is done using atomicAdd() to the
    // array in __shared__.
    (...)
    __syncthreads();

    // CHECK (Part B)
    if (threadIdx.x == 0) {
      for (int i = 0; i < cell_na; i++) {
        printf("v[%4d] = %3d\n", i, v[i]);
      }
    }
    __syncthreads();
    // END CHECK

    // Compute the prefix sum, capturing the final value in the last element of the
    // array. (It is guaranteed that cell_na <= 1023).
    blockExclusivePrefixSum<int>(v, s1, s2, cell_na + 1);

    // CHECK (Part C)
    if (threadIdx.x == 0) {
      for (int i = 0; i < cell_na; i++) {
        printf("PrefixSum v[%4d] = %5d\n", i, v[i]);
      }
    }
    __syncthreads();
    // END CHECK
    
    // Other computations
    __syncthreads();
    pos += blockDim.x;
  }
}

What seems to happen is that, at first, I see the expected behavior. The print statements from the check in part A fire off, then those in part B, and then those in part C. But on subsequent trips through the loop all of that printed information starts to come in bizarre order. Print statements in part B will record that the array is full of zeros, and those in part C will record that it is full of values but not yet a prefix sum. I was getting bad results from the kernel, too. Abstracting the part I needed (the special case where n < 1024), I was able to manually insert the function and then get the correct results. compute-sanitizer was not indicating any errors either way, but the kernel now runs very smoothly and produces output as I expect. (More testing is obviously warranted.)

I’m trying to understand how the results of those print statements could run so counter to the order in whcih things were clearly happening, especially with all of the __syncthreads() operations which should have kept things in order. I wonder if having __syncthreads in the inline function created the odd behavior, perhaps indiating a compiler bug?

Update: it seems my __device__ function was prone to its own memory errors. I think this issue can be closed for now. While it’s certainly something to put in the documentation if a __device__ function contains an internal __syncthreads() call, it doesn’t seem that there is evidence at the moment that this could play badly with any compiler bug.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.