Why I get different result between these two code?

The first piece of code:

__global__ void rgb2Gray(uchar * rgbData, int * grayData, int * scaleData, int totalSize)
{
	size_t blockSize = blockDim.x * blockDim.y * blockDim.z;
	size_t uniBlockInd = blockIdx.x + blockIdx.y * gridDim.x;
	size_t threadOffset = threadIdx.x + threadIdx.y * blockDim.x
		+ threadIdx.z * blockDim.x * blockDim.y;
	size_t uniThreadInd = uniBlockInd * blockSize + threadOffset;


	if (uniThreadInd / 3 < totalSize)
	{
		grayData[uniThreadInd / 3]++;
	}
}

The second piece of code:

__global__ void rgb2Gray(uchar * rgbData, int * grayData, int * scaleData, int totalSize)
{
	size_t blockSize = blockDim.x * blockDim.y * blockDim.z;
	size_t uniBlockInd = blockIdx.x + blockIdx.y * gridDim.x;
	size_t threadOffset = threadIdx.x + threadIdx.y * blockDim.x
		+ threadIdx.z * blockDim.x * blockDim.y;
	size_t uniThreadInd = uniBlockInd * blockSize + threadOffset;


	if (uniThreadInd / 3 < totalSize && threadIdx.x == 1) {
		grayData[uniThreadInd / 3] ++;
	}

	if (uniThreadInd / 3 < totalSize && threadIdx.x == 0) {
		grayData[uniThreadInd / 3] ++;
	}

	if (uniThreadInd / 3 < totalSize && threadIdx.x == 2) {
		grayData[uniThreadInd / 3] ++;
	}
}

The main function is as shown below:

dim3 nThreadPerBlock(3, 15, 15);
dim3 nBlockPerGrid((rgbMat->cols + 14) / 15, (rgbMat->rows + 14) / 15, 1);
rgb2Gray << <nBlockPerGrid, nThreadPerBlock >> >(dev_rgbData, dev_grayData, dev_scaleData, rgbMat->rows * rgbMat->cols);

The result expected will show that all grayData equals 3, but only the second piece of code show me the right answer. Maybe the first code encounters some problem like data race? Can someone explain that? Thanks for your help.

Replace “maybe” with “for sure” and look up CUDA atomic operations ;)

Thanks for your reply. Now I’m sure that there is data race in the first code piece, but what I’m really confused about is that the second piece code can get the right answer, isn’t there data race in the second piece of code? Or CUDA automatically add some barrier according to threadIdx.x?

There is no data race in the second block. I’m not 100% but I pretty sure the reason is that you only have 3 threads running that access the same memory object. With the if statements helping to delay the last two accesses you are “beating” the data race. I highly suggest you change your code from the current structure though, as if the number of threads is increased at all, the likeliness of a data race occurring goes substantially up.

I would agree with cbuchner1, that you should look up atomic operations if this is the only way you can solve your problem. Otherwise, find a more practical solution that doesn’t use atomic operations and goes around the data race issue.

Thank you for your suggestion. I will look up the chapter about atomic operations for more information.