Are device functions with __syncthreads() a bad idea?

dscerutti · September 16, 2024, 3:15am

Hello,

I have the following __device__ __forceinline__ function, which seems to work fine when I unit test it, and on the FIRST pass through a loop in one of the kernels that calls it. However, on subsequent passes, I start to see very strange behavior. The function performs ye olde in-place prefix sum on data in shared.

/// \brief Compute the prefix sum over an array of values at the block level.  It is expected that
///        all memory used by this routine be exclusive to one block.  Whatever type is used to
///        represent the data must be able to hold the sum of all data.  The thread block must be
///        sized as a multiple of the warp width.  This will compute an exclusive prefix sum and
///        leave the result in place of the original data, with the total of all values returned.
///
/// \param v  The array of data elements
/// \param s1  The first array of scratch values, allocated with at least one value for every
///            warp width or partial wapr width worth of values in v
/// \param s2  The second array of scratch values, allocated with at least one value for every
///            warp width or partial warp width worth of values in s1, and at most as many values
///            as the warp has lanes.
/// \param n   The total number of values in v
template <typename T> __device__ __forceinline__
T blockExclusivePrefixSum(T* v, T* s1, T* s2, const int n) {
  T result = (T)(0);
  if (n < warp_size_int && threadIdx.x < warp_size_int) {
    T var = (threadIdx.x < n) ? v[threadIdx.x] : (T)(0);
    EXCLUSIVE_WARP_PREFIXSUM_SAVETOTAL(var, threadIdx.x, result);
    v[threadIdx.x] = var;
  }
  else {
    const int lane_idx = (threadIdx.x & warp_bits_mask_int);
    const int nbatch = ((n + warp_bits_mask_int) >> warp_bits);
    for (int warp_pos = (threadIdx.x >> warp_bits); warp_pos < nbatch;
         warp_pos += (blockDim.x >> warp_bits)) {
      const int idx_test = (warp_pos << warp_bits) + lane_idx; 
      T var = (idx_test < n) ? v[idx_test] : (T)(0);
      T warp_total;
      EXCLUSIVE_WARP_PREFIXSUM_SAVETOTAL(var, lane_idx, warp_total);
      if (idx_test < n) {
        v[idx_test] = var;
      }
      if (lane_idx == 0) {
        s1[warp_pos] = warp_total;
      }
    }
    __syncthreads();
    if (n > warp_size_int * warp_size_int) {
      const int nbundle = ((nbatch + warp_bits_mask_int) >> warp_bits);
      for (int warp_pos = (threadIdx.x >> warp_bits); warp_pos < nbundle; 
           warp_pos += (blockDim.x >> warp_bits)) {
        const int idx_test = (warp_pos << warp_bits) + lane_idx;
        T var = (idx_test < nbatch) ? s1[idx_test] : (T)(0);
        T warp_total;
        EXCLUSIVE_WARP_PREFIXSUM_SAVETOTAL(var, lane_idx, warp_total);
        if (idx_test < nbatch) {
          s1[idx_test] = var;
        }
        s2[warp_pos] = warp_total;
      }
      __syncthreads();
      if (threadIdx.x < warp_size_int) {
        T var = (threadIdx.x < nbundle) ? s2[threadIdx.x] : (T)(0);
        EXCLUSIVE_WARP_PREFIXSUM_SAVETOTAL(var, lane_idx, result);
        if (threadIdx.x < nbundle) {
          s2[threadIdx.x] = var;
        }
      }
      __syncthreads();
      for (int warp_pos = (threadIdx.x >> warp_bits); warp_pos < nbundle;
           warp_pos += (blockDim.x >> warp_bits)) {
        const int idx_test = (warp_pos << warp_bits) + lane_idx;
        if (idx_test < nbatch) {
          s1[idx_test] += s2[warp_pos];
        }
      }
      __syncthreads();
    }
    else {
      if (threadIdx.x < warp_size_int) {
        T var = (threadIdx.x < nbatch) ? s1[threadIdx.x] : (T)(0);
        EXCLUSIVE_WARP_PREFIXSUM_SAVETOTAL(var, lane_idx, result);
        if (threadIdx.x < nbatch) {
          s1[threadIdx.x] = var;
        }
      }
      __syncthreads();
    }
    for (int warp_pos = (threadIdx.x >> warp_bits); warp_pos < nbatch; 
         warp_pos += (blockDim.x >> warp_bits)) {
      const T boost = s1[warp_pos];
      v[(warp_pos << warp_bits) + lane_idx] += boost;
    }
  }
  __syncthreads();
  return result;
}

The behavior I am seeing might be described as this. If I implement a calling function as follows:

__global__ void mykernel() {
  __shared__ int v[1024], s1[32]. s2[32];
  __shared__ int cell_na;

  // Decide on the value of cell_na, which will become the length of the prefix sum
  // Main loop
  int pos = threadIdx.x;  
  while (pos < 1000) {

    // Initialize the prefix sum
    __syncthreads();
    for (int i = threadIdx.x; i < 1024; i += blockDim.x) {
      v[i] = 0;

      // CHECK (Part A)
      printf("Initialize v[%4d] = 0\n", i);
      // END CHECK
    }
    __syncthreads();

    // Fill the bins, e.g. v[0] := 15, v[1] :=14.  This is done using atomicAdd() to the
    // array in __shared__.
    (...)
    __syncthreads();

    // CHECK (Part B)
    if (threadIdx.x == 0) {
      for (int i = 0; i < cell_na; i++) {
        printf("v[%4d] = %3d\n", i, v[i]);
      }
    }
    __syncthreads();
    // END CHECK

    // Compute the prefix sum, capturing the final value in the last element of the
    // array. (It is guaranteed that cell_na <= 1023).
    blockExclusivePrefixSum<int>(v, s1, s2, cell_na + 1);

    // CHECK (Part C)
    if (threadIdx.x == 0) {
      for (int i = 0; i < cell_na; i++) {
        printf("PrefixSum v[%4d] = %5d\n", i, v[i]);
      }
    }
    __syncthreads();
    // END CHECK
    
    // Other computations
    __syncthreads();
    pos += blockDim.x;
  }
}

What seems to happen is that, at first, I see the expected behavior. The print statements from the check in part A fire off, then those in part B, and then those in part C. But on subsequent trips through the loop all of that printed information starts to come in bizarre order. Print statements in part B will record that the array is full of zeros, and those in part C will record that it is full of values but not yet a prefix sum. I was getting bad results from the kernel, too. Abstracting the part I needed (the special case where n < 1024), I was able to manually insert the function and then get the correct results. compute-sanitizer was not indicating any errors either way, but the kernel now runs very smoothly and produces output as I expect. (More testing is obviously warranted.)

I’m trying to understand how the results of those print statements could run so counter to the order in whcih things were clearly happening, especially with all of the __syncthreads() operations which should have kept things in order. I wonder if having __syncthreads in the inline function created the odd behavior, perhaps indiating a compiler bug?

dscerutti · September 18, 2024, 6:27pm

Update: it seems my __device__ function was prone to its own memory errors. I think this issue can be closed for now. While it’s certainly something to put in the documentation if a __device__ function contains an internal __syncthreads() call, it doesn’t seem that there is evidence at the moment that this could play badly with any compiler bug.

system · October 2, 2024, 6:28pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
help with kernel synchronization? CUDA Programming and Performance	22	13907	August 26, 2010
Cub Library CUDA Programming and Performance	4	5515	January 17, 2017
A question about calculatePartialSum sample code in CUDA_C programming guide CUDA Programming and Performance	13	767	September 29, 2023
Why this kernel "needs" a __syncthreads() statement to work? CUDA Programming and Performance	3	836	May 7, 2017
Strange problem with a __device__ function CUDA Programming and Performance	14	8026	September 3, 2010
if-else WARP divergence WARP divergence CUDA Programming and Performance	17	16830	January 5, 2008
inter-warp synchronization troubles with persistent threads (__threadfence_block() ?) CUDA Programming and Performance	6	2774	December 8, 2010
Problem with correct branching within a warp CUDA Programming and Performance	23	15660	May 28, 2009
Debugging inside my kernel I'm sure its something small. CUDA Programming and Performance	12	10402	June 19, 2008
atomicCAS does NOT seem to work Hardware Bug? or Improper use?? TESLA C1060 CUDA Programming and Performance	70	19765	January 21, 2010

Are __device__ functions with __syncthreads() a bad idea?

Related topics

Are device functions with __syncthreads() a bad idea?