I have a loop in a CUDA kernel as follows:
__global__ void CreateModelsfromStatsKernel(Buffers buf, const CudaFrameInfo cu)
{
int row = blockIdx.y * blockDim.y + threadIdx.y;
int start_row = row * cu.rectHeight;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int start_col = col * cu.rectWidth;
int x_offset, y_offset;
uint32_t offset;
uint32_t box_offset = (row * cu.numRectsX) + col;
for(x_offset = 0; x_offset < cu.rectWidth; x_offset++)
{
for(y_offset = 0; y_offset < cu.rectHeight; y_offset++)
{
offset = (start_row + y_offset) * cu.fbWidth + start_col + x_offset;
buf.IavgF[offset] /= buf.Icount[box_offset];
if(buf.IavgF[offset] < buf.boxMinMean[box_offset])
{
buf.boxMinMean[box_offset] = buf.IavgF[offset];
}
buf.boxMinMean[box_offset] = 37;
}
}
}
buf.IavgF and buf.Icount are created using cudaMalloc while buf.boxMinMean was created using cudaMallocManaged.
In the calling host C code after using cudaDeviceSynchronise, I see that this line has no effect:
buf.boxMinMean[box_offset] = 37;
i.e. calling the kernel does NOT set this value.
Now, if I remove this line:
buf.IavgF[offset] /= buf.Icount[box_offset];
All of a sudden, I see buf.boxMinMean[box_offset] set to 37 as expected. What on earth is going on? I have spent most of the day trying to work out just why CUDA is not working.
Have I found a bug in CUDA?
Thanks.