CUDA racecheck, shared memory array and cudaDeviceSynchronize()

I posted this problem on Stack Overflow but never got an answer (see http://stackoverflow.com/questions/13861017/cuda-racecheck-shared-memory-array-and-cudadevicesynchronize). I may be luckier here.

This was tested on Linux with:

GPU: GeForce GT 650M
Driver Version: 313.09
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221

Consider the following simple program:

template 
__global__ void kernel_test()
{
    const int SIZE_X = 4;
    const int SIZE_Y = 4;

    __shared__ float tmp;

    for (unsigned int i = 0; i < SIZE_X; i++)
        for (unsigned int j = 0; j < SIZE_Y; j++)
            tmp[i*blockDim.x*SIZE_Y + j*blockDim.x + threadIdx.x] = threadIdx.x;
}

int main()
{
  const unsigned int NTHREADS = 32;

  //kernel_test<NTHREADS><<<32, NTHREADS>>>(); // ---> works fine
  kernel_test<NTHREADS><<<64, NTHREADS>>>();

  cudaDeviceSynchronize(); // ---> gives racecheck errors if NBLOCKS > 32
}

This can be compiled with nvcc test.cu --ptxas-options=-v -o test

Running the program with cuda-memcheck and the racecheck tool: cuda-memcheck --tool racecheck test, I get errors depending on the number of blocks, and this seems to be caused by cudaDeviceSynchronize().

The errors detected by the tool look like this:

[i]
========= ERROR: Potential WAW hazard detected at shared 0x6 in block (57, 0, 0) :
========= Write Thread (0, 0, 0) at 0x00000048 in …h:403:void kernel_test(void)
========= Write Thread (1, 0, 0) at 0x00000048 in …h:403:void kernel_test(void)
========= Current Value : 0, Incoming Value : 128

    ========= INFO:(Identical data being written) Potential WAW hazard detected at shared 0x0 in block (47, 0, 0) :
    ========= Write Thread (32, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
    ========= Write Thread (0, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
    ========= Current Value : 0, Incoming Value : 0

[/i]

Let’s consider two different cases:

  • kernel_test(); : 32 blocks, 32 threads => does not lead to any apparent racecheck error.
  • kernel_test(); : 64 blocks, 32 threads => leads to WAW hazards (threadId.x = 32?!) and errors.

So what am I doing wrong here? Am I doing something wrong with shared memory?

For starters, the cudaDeviceSynchronize() isn’t the cause; your kernel is the cause, but it’s an asynchronous call, so the error is caught on your call to cudaDeviceSynchronize().

As for kernel, your shared memory is of size SIZE_XSIZE_YNTHREADS (which in the example translates to 512 elements per block). In your nested loops you index into it using [i*blockDim.x*SIZE_Y + j*blockDim.x + threadIdx.x] – this is where your problem is.

To be more specific, your i and j values will range from [0, 4), your threadIdx.x from [0, 32), and your SIZE_{X | Y} values are 4.
When blockDim.x is 64, your maximum index used in the loop will be 991 (from 3644 + 3*64 + 31). When your blockDim.x is 32, your maximum index will be 511.

Based on your code, you should get errors whenever your NBLOCKS exceeds your NTHREADS

I concur with alrikai’s analysis. @alrikai: If you are on Stackoverflow, it would be great if you could post your answer there. I would be happy to vote it up.

@njuffa That sounds good, I just did so. Thanks

Thanks, you got 10 more points now :-)

Oh I see, nice catch! And the “memcheck” tool of cuda-memcheck cannot catch shared memory errors, only “racecheck” is able to do that?

Also, for the code, I have been twisting the original code to test things out.

If one considers this, which is actually closer to what I had (I do not know why I started linearizing everything, since I even added an error…):

__shared__ float tmp[NTHREADS];

for (unsigned int i = 0; i < SIZE_X; i++)
    for (unsigned int j = 0; j < SIZE_Y; j++)
        tmp[i][j][threadIdx.x] = threadIdx.x;

I get the same error. Yet, shared memory is of size SIZE_X * SIZE_Y * NTHREADS, and I believe that I stay within the memory range. So what is going on here?

Does it happen only for NBLOCKS > NTHREADS?

Apparently yes.

Hi bchr, can you update to the latest available linux driver (310.32) and retry your test ?

I just tested with the latest linux drivers (313.18) and apparently I do not see the error anymore.