I posted this problem on Stack Overflow but never got an answer (see http://stackoverflow.com/questions/13861017/cuda-racecheck-shared-memory-array-and-cudadevicesynchronize). I may be luckier here.
This was tested on Linux with:
GPU: GeForce GT 650M
Driver Version: 313.09
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221
Consider the following simple program:
template
__global__ void kernel_test()
{
const int SIZE_X = 4;
const int SIZE_Y = 4;
__shared__ float tmp;
for (unsigned int i = 0; i < SIZE_X; i++)
for (unsigned int j = 0; j < SIZE_Y; j++)
tmp[i*blockDim.x*SIZE_Y + j*blockDim.x + threadIdx.x] = threadIdx.x;
}
int main()
{
const unsigned int NTHREADS = 32;
//kernel_test<NTHREADS><<<32, NTHREADS>>>(); // ---> works fine
kernel_test<NTHREADS><<<64, NTHREADS>>>();
cudaDeviceSynchronize(); // ---> gives racecheck errors if NBLOCKS > 32
}
This can be compiled with nvcc test.cu --ptxas-options=-v -o test
Running the program with cuda-memcheck and the racecheck tool: cuda-memcheck --tool racecheck test, I get errors depending on the number of blocks, and this seems to be caused by cudaDeviceSynchronize().
The errors detected by the tool look like this:
[i]
========= ERROR: Potential WAW hazard detected at shared 0x6 in block (57, 0, 0) :
========= Write Thread (0, 0, 0) at 0x00000048 in …h:403:void kernel_test(void)
========= Write Thread (1, 0, 0) at 0x00000048 in …h:403:void kernel_test(void)
========= Current Value : 0, Incoming Value : 128
========= INFO:(Identical data being written) Potential WAW hazard detected at shared 0x0 in block (47, 0, 0) :
========= Write Thread (32, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Write Thread (0, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Current Value : 0, Incoming Value : 0
[/i]
Let’s consider two different cases:
- kernel_test(); : 32 blocks, 32 threads => does not lead to any apparent racecheck error.
- kernel_test(); : 64 blocks, 32 threads => leads to WAW hazards (threadId.x = 32?!) and errors.
So what am I doing wrong here? Am I doing something wrong with shared memory?