Compute sanitizer racecheck bug with cg::thread_block_tile for tile size > 32

The compute-sanitizer racecheck tool seems to have a bug with tilesize > 32. It reports false positive errors even though there is tile.sync.

A minimum working example would be:

// Kernel to perform reduction using tiles
template <typename Op>
__global__ void reduceKernel(float *input, float *output, int numElements, Op operation)
{
  int tid = blockIdx.x * blockDim.x + threadIdx.x;

  __shared__ float S;
  cg::thread_block cta = cg::this_thread_block();
  cg::thread_block_tile<TileSize> tile = cg::tiled_partition<TileSize>(cta);

  float value = (tid < numElements) ? input[tid] : 0.0f;

  // Perform tile-based reduction
  const uint group_rank = tid / TileSize;

  if (group_rank == 0)
  {
    value = tileReduce(tile, value, operation); // a function for reduction uses tile.shfl_down
    if (tile.thread_rank() == 0)
    {
      *output = value;
      S = value;
    }
    tile.sync();
  }
  if (group_rank == 0 && tile.thread_rank() == 0)
  {
    printf("S is %f\n", S);
  }
}

If I perform compute-sanitizer --tool racecheck on this kernel with TileSize > 32. Compute-sanitizer reports a false race error between write and read operations on S. It is clearly false since there is a tile.sync() operation between them. The error goes away when TileSize <= 32.

The result from the code is correct, so I think it is not an error with the thread_block_tile class but just a bug in the compute-sanitizer racecheck tool.

Can you provide a complete reproduction example? Tile size should be no more than 32 (size of a warp in CUDA), and the compiler should refuse to compile code using shfl_down if that is not the case. Example below with tile size set to 64:

test.cu(20): error: class "cooperative_groups::__v1::thread_block_tile<64U, void>" has no member "shfl_down"
          value = operation(value, tile.shfl_down(value, offset));
                                        ^

1 error detected in the compilation of "test.cu".

For future reference, here is a link to the relevant section of the CUDA programing guide: CUDA C++ Programming Guide.

I get it. Sorry for not sending the full code, I didn’t want to complicate it. I actually defined a macro based on TileSize when it is greater than 32, it just defines a subtile and performs reduction using that.

I have done some refactoring to remove the code from any dependencies, so it won’t match with the post. compute sanitizer --tool racecheck on this code will throw an error with TileSize > 32 and won’t otherwise. However I don’t think there is actually any race since there is a tile.sync() operation.

Here is the full code:

#include <stdio.h>
#include <cuda_runtime.h>
#include <cub/cub.cuh> // Include CUB
#include <cooperative_groups.h>
namespace cg = cooperative_groups;

#define BlockSize 128
#define TileSize 64
#define SubTileSize 32

#if TileSize > 32
template <typename Op>
__device__ float warpReduce(cg::thread_block_tile<SubTileSize> tile, float value, Op operation)

{
  // Intra-tile reduction
  for (int offset = tile.size() / 2; offset > 0; offset /= 2)
  {
    value = operation(value, tile.shfl_down(value, offset));
  }
  return value;
}

// Generalized tile-based reduction function
template <typename Op>
__device__ void tileReduce(cg::thread_block_tile<TileSize> tile, float *orig_val, Op operation, float &red_val)
{
  float value = 0.0f;
  cg::thread_block_tile<SubTileSize> subtile = cg::tiled_partition<SubTileSize>(tile);
  if (subtile.meta_group_rank() == 0)
  {
    subtile.sync();
    for (uint i = subtile.thread_rank(); i < TileSize; i += SubTileSize)
      value = max(value, orig_val[i]);
    subtile.sync();
    value = warpReduce(subtile, value, operation);
  }
  tile.sync();

  if (tile.thread_rank() == 0)
    red_val = value;
}

#else
template <typename Op>
__device__ void tileReduce(cg::thread_block_tile<TileSize> tile, float *orig_val, Op operation, float &red_val)

{
  // Intra-tile reduction
  float value = orig_val[tile.thread_rank()];
  for (int offset = tile.size() / 2; offset > 0; offset /= 2)
  {
    value = operation(value, tile.shfl_down(value, offset));
  }
  tile.sync();
  if (tile.thread_rank() == 0)
    red_val = value;
}
#endif

// Kernel to perform reduction using tiles
template <typename Op>
__global__ void reduceKernel(float *input, float *output, int numElements, Op operation)
{
  __shared__ float orig_val[TileSize];
  __shared__ float red_val;

  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  const uint group_rank = tid / TileSize;
  cg::thread_block cta = cg::this_thread_block();
  cg::thread_block_tile<TileSize> tile = cg::tiled_partition<TileSize>(cta);

  // Perform tile-based reduction

  if (group_rank == 0)
  {
    for (uint i = tile.thread_rank(); i < numElements; i += TileSize)
      orig_val[tile.thread_rank()] = operation(orig_val[tile.thread_rank()], input[i]);

    tile.sync();
    tileReduce(tile, orig_val, operation, red_val); // a function for reduction uses that uses tile.shfl_down
    tile.sync();

    if (tile.thread_rank() == 0)
    {
      *output = red_val;
    }
    tile.sync();
  }
  if (group_rank == 0 && tile.thread_rank() == 0)
  {
    printf("S is %f\n", red_val);
  }
}

// Host-side code
int main()
{
  const int numElements = 1024;

  // Allocate and initialize host data
  float h_input[numElements];
  for (int i = 0; i < numElements; ++i)
  {
    h_input[i] = 1.0f; // simple input where each element is 1.0
  }
  h_input[2] = 2.0f; // change one element to 2.0
  h_input[16] = 5.0f;
  h_input[60] = 6.3f;

  // Allocate device memory
  float *d_input = nullptr;
  float *d_output = nullptr;
  cudaMalloc((void **)&d_input, numElements * sizeof(float));
  cudaMalloc((void **)&d_output, sizeof(float));

  // Copy input data to device
  cudaMemcpy(d_input, h_input, numElements * sizeof(float), cudaMemcpyHostToDevice);

  // Initialize output to zero
  float h_output = 0.0f;
  cudaMemcpy(d_output, &h_output, sizeof(float), cudaMemcpyHostToDevice);

  // Launch the kernel (using cub::Sum functor)
  // execKernel((reduceKernel), 1, BlockSize, 0, true,
  //            d_input, d_output, numElements, cub::Max());
  reduceKernel<<<1, BlockSize>>>(d_input, d_output, numElements, cub::Max());

  // Copy the result back to host
  cudaMemcpy(&h_output, d_output, sizeof(float), cudaMemcpyDeviceToHost);

  // Print the result
  printf("Reduction result: %f\n", h_output); // Expect 1024.0 for sum reduction

  // Clean up
  cudaFree(d_input);
  cudaFree(d_output);

  return 0;
}

Edit: Deleting general reply and changing to threaded reply.

Thanks, unfortunately locally I don’t see any race with racecheck. Can you confirm whether you are running the latest compute-sanitizer version from the 12.6 toolkit, and if not try it? It’s possible this is a bug we already fixed. Also, can you provide the GPU architecture you are using? Thanks!

I am using compute-sanitizer from 12.6 toolkit:

compute-sanitizer --version
NVIDIA (R) Compute Sanitizer
Copyright (c) 2020-2024 NVIDIA Corporation
Version 2024.3.0.0 (build 34714021) (public-release)

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:10:22_PDT_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0

I am on SM 80 (A100 80GB cards)

This is my compilation recipe:

nvcc  -g -Xcompiler -fopenmp -lineinfo -O3 -arch=sm_80 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c tilereduce.cu -o build/obj/./tilereduce.cu.o
mkdir -p build/exe/./
nvcc build/obj/./tilereduce.cu.o -o build/exe/./tilereduce.cu.exe -lcuda -lgomp

Thank you for these details! I can confirm I can reproduce this locally. I’ll file a bug internally to track this issue.

Great, thanks. What was your environment when you tried locally and it didn’t give any errors?
I have access to some other machines and may try them until this gets resolved.

I could not repro this with architecture sm_75, but I can with sm_80+

Edit: code does not actually work on sm_75 since cg::thread_block cta initialization requires user to manually provide shared memory as specified here: CUDA C++ Programming Guide. This is why it did not report errors for me on sm_75.

1 Like

I see, thanks for the note.
Is it okay to define __shared__ memory inside a device function that may not be accessed by all threads in the block but definitely by all threads in a tile?

Also, even though it didn’t work on sm_75, I guess we can agree that this is a false positive error by racecheck.

Is it okay to define __shared__ memory inside a device function that may not be accessed by all threads in the block but definitely by all threads in a tile?

It should be okay yes, but technically it is accessible by all threads in a block, regardless of whether they enter the function.

Also, even though it didn’t work on sm_75, I guess we can agree that this is a false positive error by racecheck.

Correct

1 Like