Weird CUDA problem

I have a loop in a CUDA kernel as follows:

__global__ void CreateModelsfromStatsKernel(Buffers buf, const CudaFrameInfo cu)
{
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int start_row = row * cu.rectHeight;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int start_col = col * cu.rectWidth;
    int x_offset, y_offset;
    uint32_t offset;
    uint32_t box_offset = (row * cu.numRectsX) + col;

    for(x_offset = 0; x_offset < cu.rectWidth; x_offset++)
	{
		for(y_offset = 0; y_offset < cu.rectHeight; y_offset++)
		{
			offset = (start_row + y_offset) * cu.fbWidth + start_col + x_offset;
			buf.IavgF[offset] /= buf.Icount[box_offset];

            if(buf.IavgF[offset] < buf.boxMinMean[box_offset])
            {
                buf.boxMinMean[box_offset] = buf.IavgF[offset];
            }
            buf.boxMinMean[box_offset] = 37;
		}
	}
}

buf.IavgF and buf.Icount are created using cudaMalloc while buf.boxMinMean was created using cudaMallocManaged.

In the calling host C code after using cudaDeviceSynchronise, I see that this line has no effect:

buf.boxMinMean[box_offset] = 37;

i.e. calling the kernel does NOT set this value.

Now, if I remove this line:

buf.IavgF[offset] /= buf.Icount[box_offset];

All of a sudden, I see buf.boxMinMean[box_offset] set to 37 as expected. What on earth is going on? I have spent most of the day trying to work out just why CUDA is not working.

Have I found a bug in CUDA?

Thanks.

Suggest to have CUDA programing issue at Latest CUDA/CUDA Programming and Performance topics - NVIDIA Developer Forums

1 Like