Compute sanitizer racecheck bug with cg::thread_block_tile for tile size > 32

samkawtikwar · September 24, 2024, 5:24pm

The compute-sanitizer racecheck tool seems to have a bug with tilesize > 32. It reports false positive errors even though there is tile.sync.

A minimum working example would be:

// Kernel to perform reduction using tiles
template <typename Op>
__global__ void reduceKernel(float *input, float *output, int numElements, Op operation)
{
  int tid = blockIdx.x * blockDim.x + threadIdx.x;

  __shared__ float S;
  cg::thread_block cta = cg::this_thread_block();
  cg::thread_block_tile<TileSize> tile = cg::tiled_partition<TileSize>(cta);

  float value = (tid < numElements) ? input[tid] : 0.0f;

  // Perform tile-based reduction
  const uint group_rank = tid / TileSize;

  if (group_rank == 0)
  {
    value = tileReduce(tile, value, operation); // a function for reduction uses tile.shfl_down
    if (tile.thread_rank() == 0)
    {
      *output = value;
      S = value;
    }
    tile.sync();
  }
  if (group_rank == 0 && tile.thread_rank() == 0)
  {
    printf("S is %f\n", S);
  }
}

If I perform compute-sanitizer --tool racecheck on this kernel with TileSize > 32. Compute-sanitizer reports a false race error between write and read operations on S. It is clearly false since there is a tile.sync() operation between them. The error goes away when TileSize <= 32.

The result from the code is correct, so I think it is not an error with the thread_block_tile class but just a bug in the compute-sanitizer racecheck tool.

aladram · September 24, 2024, 5:35pm

Can you provide a complete reproduction example? Tile size should be no more than 32 (size of a warp in CUDA), and the compiler should refuse to compile code using shfl_down if that is not the case. Example below with tile size set to 64:

test.cu(20): error: class "cooperative_groups::__v1::thread_block_tile<64U, void>" has no member "shfl_down"
          value = operation(value, tile.shfl_down(value, offset));
                                        ^

1 error detected in the compilation of "test.cu".

For future reference, here is a link to the relevant section of the CUDA programing guide: CUDA C++ Programming Guide.

samkawtikwar · September 24, 2024, 7:32pm

I get it. Sorry for not sending the full code, I didn’t want to complicate it. I actually defined a macro based on TileSize when it is greater than 32, it just defines a subtile and performs reduction using that.

I have done some refactoring to remove the code from any dependencies, so it won’t match with the post. compute sanitizer --tool racecheck on this code will throw an error with TileSize > 32 and won’t otherwise. However I don’t think there is actually any race since there is a tile.sync() operation.

Here is the full code:

#include <stdio.h>
#include <cuda_runtime.h>
#include <cub/cub.cuh> // Include CUB
#include <cooperative_groups.h>
namespace cg = cooperative_groups;

#define BlockSize 128
#define TileSize 64
#define SubTileSize 32

#if TileSize > 32
template <typename Op>
__device__ float warpReduce(cg::thread_block_tile<SubTileSize> tile, float value, Op operation)

{
  // Intra-tile reduction
  for (int offset = tile.size() / 2; offset > 0; offset /= 2)
  {
    value = operation(value, tile.shfl_down(value, offset));
  }
  return value;
}

// Generalized tile-based reduction function
template <typename Op>
__device__ void tileReduce(cg::thread_block_tile<TileSize> tile, float *orig_val, Op operation, float &red_val)
{
  float value = 0.0f;
  cg::thread_block_tile<SubTileSize> subtile = cg::tiled_partition<SubTileSize>(tile);
  if (subtile.meta_group_rank() == 0)
  {
    subtile.sync();
    for (uint i = subtile.thread_rank(); i < TileSize; i += SubTileSize)
      value = max(value, orig_val[i]);
    subtile.sync();
    value = warpReduce(subtile, value, operation);
  }
  tile.sync();

  if (tile.thread_rank() == 0)
    red_val = value;
}

#else
template <typename Op>
__device__ void tileReduce(cg::thread_block_tile<TileSize> tile, float *orig_val, Op operation, float &red_val)

{
  // Intra-tile reduction
  float value = orig_val[tile.thread_rank()];
  for (int offset = tile.size() / 2; offset > 0; offset /= 2)
  {
    value = operation(value, tile.shfl_down(value, offset));
  }
  tile.sync();
  if (tile.thread_rank() == 0)
    red_val = value;
}
#endif

// Kernel to perform reduction using tiles
template <typename Op>
__global__ void reduceKernel(float *input, float *output, int numElements, Op operation)
{
  __shared__ float orig_val[TileSize];
  __shared__ float red_val;

  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  const uint group_rank = tid / TileSize;
  cg::thread_block cta = cg::this_thread_block();
  cg::thread_block_tile<TileSize> tile = cg::tiled_partition<TileSize>(cta);

  // Perform tile-based reduction

  if (group_rank == 0)
  {
    for (uint i = tile.thread_rank(); i < numElements; i += TileSize)
      orig_val[tile.thread_rank()] = operation(orig_val[tile.thread_rank()], input[i]);

    tile.sync();
    tileReduce(tile, orig_val, operation, red_val); // a function for reduction uses that uses tile.shfl_down
    tile.sync();

    if (tile.thread_rank() == 0)
    {
      *output = red_val;
    }
    tile.sync();
  }
  if (group_rank == 0 && tile.thread_rank() == 0)
  {
    printf("S is %f\n", red_val);
  }
}

// Host-side code
int main()
{
  const int numElements = 1024;

  // Allocate and initialize host data
  float h_input[numElements];
  for (int i = 0; i < numElements; ++i)
  {
    h_input[i] = 1.0f; // simple input where each element is 1.0
  }
  h_input[2] = 2.0f; // change one element to 2.0
  h_input[16] = 5.0f;
  h_input[60] = 6.3f;

  // Allocate device memory
  float *d_input = nullptr;
  float *d_output = nullptr;
  cudaMalloc((void **)&d_input, numElements * sizeof(float));
  cudaMalloc((void **)&d_output, sizeof(float));

  // Copy input data to device
  cudaMemcpy(d_input, h_input, numElements * sizeof(float), cudaMemcpyHostToDevice);

  // Initialize output to zero
  float h_output = 0.0f;
  cudaMemcpy(d_output, &h_output, sizeof(float), cudaMemcpyHostToDevice);

  // Launch the kernel (using cub::Sum functor)
  // execKernel((reduceKernel), 1, BlockSize, 0, true,
  //            d_input, d_output, numElements, cub::Max());
  reduceKernel<<<1, BlockSize>>>(d_input, d_output, numElements, cub::Max());

  // Copy the result back to host
  cudaMemcpy(&h_output, d_output, sizeof(float), cudaMemcpyDeviceToHost);

  // Print the result
  printf("Reduction result: %f\n", h_output); // Expect 1024.0 for sum reduction

  // Clean up
  cudaFree(d_input);
  cudaFree(d_output);

  return 0;
}

Edit: Deleting general reply and changing to threaded reply.

aladram · September 24, 2024, 7:39pm

Thanks, unfortunately locally I don’t see any race with racecheck. Can you confirm whether you are running the latest compute-sanitizer version from the 12.6 toolkit, and if not try it? It’s possible this is a bug we already fixed. Also, can you provide the GPU architecture you are using? Thanks!

samkawtikwar · September 24, 2024, 7:44pm

I am using compute-sanitizer from 12.6 toolkit:

compute-sanitizer --version
NVIDIA (R) Compute Sanitizer
Copyright (c) 2020-2024 NVIDIA Corporation
Version 2024.3.0.0 (build 34714021) (public-release)

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:10:22_PDT_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0

I am on SM 80 (A100 80GB cards)

This is my compilation recipe:

nvcc  -g -Xcompiler -fopenmp -lineinfo -O3 -arch=sm_80 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c tilereduce.cu -o build/obj/./tilereduce.cu.o
mkdir -p build/exe/./
nvcc build/obj/./tilereduce.cu.o -o build/exe/./tilereduce.cu.exe -lcuda -lgomp

aladram · September 24, 2024, 8:22pm

Thank you for these details! I can confirm I can reproduce this locally. I’ll file a bug internally to track this issue.

samkawtikwar · September 24, 2024, 8:24pm

Great, thanks. What was your environment when you tried locally and it didn’t give any errors?
I have access to some other machines and may try them until this gets resolved.

aladram · September 24, 2024, 8:30pm

I could not repro this with architecture sm_75, but I can with sm_80+

Edit: code does not actually work on sm_75 since cg::thread_block cta initialization requires user to manually provide shared memory as specified here: CUDA C++ Programming Guide. This is why it did not report errors for me on sm_75.

samkawtikwar · September 24, 2024, 10:39pm

I see, thanks for the note.
Is it okay to define __shared__ memory inside a device function that may not be accessed by all threads in the block but definitely by all threads in a tile?

Also, even though it didn’t work on sm_75, I guess we can agree that this is a false positive error by racecheck.

aladram · September 24, 2024, 10:55pm

Is it okay to define __shared__ memory inside a device function that may not be accessed by all threads in the block but definitely by all threads in a tile?

It should be okay yes, but technically it is accessible by all threads in a block, regardless of whether they enter the function.

Also, even though it didn’t work on sm_75, I guess we can agree that this is a false positive error by racecheck.

Correct

Topic		Replies	Views
Efficient CUDA Debugging: Memory Initialization and Thread Synchronization with NVIDIA Compute Sanitizer Technical Blog	1	369	October 7, 2024
Debugging CUDA More Efficiently with NVIDIA Compute Sanitizer Technical Blog	7	672	February 7, 2024
Is compute-sanitizer compatible with NVDec? Compute Sanitizer	45	1872	March 19, 2024
Understanding and adjusting Mark Harris's array reduction CUDA Programming and Performance	11	4142	August 26, 2018
Help with strange error CUDA Programming and Performance	8	2095	February 25, 2010
[Solved] CUDA code works on Kepler but not on Maxwell CUDA Programming and Performance	10	3123	October 5, 2015
How to launch CUDA Cooperative Groups Standard Deviation example kernel? CUDA Programming and Performance	11	2609	February 12, 2023
Silent kernel failure CUDA Programming and Performance	25	8259	May 18, 2020
Efficient CUDA Debugging: Using NVIDIA Compute Sanitizer with NVIDIA Tools Extension and Creating Custom Tools Technical Blog	2	280	October 21, 2024
cudaAPI error for certain sizes of FFT (seems to be a bug?) CUDA NVCC Compiler cuda , nvbugs , cuda-gdb	1	1055	July 10, 2022

Compute sanitizer racecheck bug with cg::thread_block_tile for tile size > 32

Related topics