CUDA racecheck, shared memory array and cudaDeviceSynchronize()

bchr · January 10, 2013, 1:57pm

I posted this problem on Stack Overflow but never got an answer (see http://stackoverflow.com/questions/13861017/cuda-racecheck-shared-memory-array-and-cudadevicesynchronize). I may be luckier here.

This was tested on Linux with:

GPU: GeForce GT 650M
Driver Version: 313.09
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221

Consider the following simple program:

template 
__global__ void kernel_test()
{
    const int SIZE_X = 4;
    const int SIZE_Y = 4;

    __shared__ float tmp;

    for (unsigned int i = 0; i < SIZE_X; i++)
        for (unsigned int j = 0; j < SIZE_Y; j++)
            tmp[i*blockDim.x*SIZE_Y + j*blockDim.x + threadIdx.x] = threadIdx.x;
}

int main()
{
  const unsigned int NTHREADS = 32;

  //kernel_test<NTHREADS><<<32, NTHREADS>>>(); // ---> works fine
  kernel_test<NTHREADS><<<64, NTHREADS>>>();

  cudaDeviceSynchronize(); // ---> gives racecheck errors if NBLOCKS > 32
}

This can be compiled with nvcc test.cu --ptxas-options=-v -o test

Running the program with cuda-memcheck and the racecheck tool: cuda-memcheck --tool racecheck test, I get errors depending on the number of blocks, and this seems to be caused by cudaDeviceSynchronize().

The errors detected by the tool look like this:

[i]
========= ERROR: Potential WAW hazard detected at shared 0x6 in block (57, 0, 0) :
========= Write Thread (0, 0, 0) at 0x00000048 in …h:403:void kernel_test(void)
========= Write Thread (1, 0, 0) at 0x00000048 in …h:403:void kernel_test(void)
========= Current Value : 0, Incoming Value : 128

    ========= INFO:(Identical data being written) Potential WAW hazard detected at shared 0x0 in block (47, 0, 0) :
    ========= Write Thread (32, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
    ========= Write Thread (0, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
    ========= Current Value : 0, Incoming Value : 0

[/i]

Let’s consider two different cases:

kernel_test(); : 32 blocks, 32 threads => does not lead to any apparent racecheck error.
kernel_test(); : 64 blocks, 32 threads => leads to WAW hazards (threadId.x = 32?!) and errors.

So what am I doing wrong here? Am I doing something wrong with shared memory?

alrikai · January 11, 2013, 1:58am

For starters, the cudaDeviceSynchronize() isn’t the cause; your kernel is the cause, but it’s an asynchronous call, so the error is caught on your call to cudaDeviceSynchronize().

As for kernel, your shared memory is of size SIZE_XSIZE_YNTHREADS (which in the example translates to 512 elements per block). In your nested loops you index into it using [i*blockDim.x*SIZE_Y + j*blockDim.x + threadIdx.x] – this is where your problem is.

To be more specific, your i and j values will range from [0, 4), your threadIdx.x from [0, 32), and your SIZE_{X | Y} values are 4.
When blockDim.x is 64, your maximum index used in the loop will be 991 (from 3644 + 3*64 + 31). When your blockDim.x is 32, your maximum index will be 511.

Based on your code, you should get errors whenever your NBLOCKS exceeds your NTHREADS

njuffa · January 11, 2013, 3:05am

I concur with alrikai’s analysis. @alrikai: If you are on Stackoverflow, it would be great if you could post your answer there. I would be happy to vote it up.

alrikai · January 11, 2013, 4:07am

@njuffa That sounds good, I just did so. Thanks

njuffa · January 11, 2013, 4:27am

Thanks, you got 10 more points now :-)

bchr · January 11, 2013, 9:41am

Oh I see, nice catch! And the “memcheck” tool of cuda-memcheck cannot catch shared memory errors, only “racecheck” is able to do that?

bchr · January 11, 2013, 9:49am

Also, for the code, I have been twisting the original code to test things out.

If one considers this, which is actually closer to what I had (I do not know why I started linearizing everything, since I even added an error…):

__shared__ float tmp[NTHREADS];

for (unsigned int i = 0; i < SIZE_X; i++)
    for (unsigned int j = 0; j < SIZE_Y; j++)
        tmp[i][j][threadIdx.x] = threadIdx.x;

I get the same error. Yet, shared memory is of size SIZE_X * SIZE_Y * NTHREADS, and I believe that I stay within the memory range. So what is going on here?

alrikai · January 11, 2013, 10:49pm

Does it happen only for NBLOCKS > NTHREADS?

bchr · January 12, 2013, 12:27pm

Apparently yes.

vyas · January 24, 2013, 12:50am

Hi bchr, can you update to the latest available linux driver (310.32) and retry your test ?

bchr · February 5, 2013, 3:02am

I just tested with the latest linux drivers (313.18) and apparently I do not see the error anymore.

Topic		Replies	Views
Does __syncthreads not work across multiple warps? CUDA Programming and Performance	9	3231	April 30, 2014
Shared memory access lead to crash when using cuda-memcheck CUDA Programming and Performance cuda	2	650	February 6, 2022
Using unified memory causes system crash CUDA Programming and Performance	28	5754	February 4, 2019
bug in cuda-memcheck CUDA-MEMCHECK	1	3047	January 17, 2013
Maximum size of memory block in cudaMallocManaged() CUDA Programming and Performance	7	2493	November 28, 2017
Problem with read access violation for large arrays in unified memory CUDA Programming and Performance	16	2997	July 16, 2018
Can cuda-memcheck disturb multi-threaded multi-gpu CUDA applications' synchronization structure? CUDA Programming and Performance	6	737	March 20, 2018
Incorrect synchronization inside a "while" loop (occuring only in Release mode) CUDA Programming and Performance	10	1585	March 28, 2015
cuBLAS SGEMM randomly crashes when running multiple host threads sharing one cuBLAS handle. CUDA Programming and Performance	5	971	December 12, 2017
Overlapping CPU and GPU operations using streams. Total failure. Any help? CUDA Programming and Performance	6	5959	April 2, 2013

CUDA racecheck, shared memory array and cudaDeviceSynchronize()

Related topics