Global Memory Reading Problem

I’ve been trying to wrap my head around a problem I’m having with an image processing application in Cuda for a few days ago, but can’t make any sense of it. I’ve created a very simple set of Cuda kernels to reproduce the problem and will post them below. I’m really not just trying to copy and shift my image, but this simple sample code has the same error I’m trying to solve:

[codebox]extern “C” global void Copy_kernel(float *odata, float *idata, int width, int height)

{

__shared__ float block[BLOCK_DIM+1][BLOCK_DIM+1];



unsigned int xIndex = blockIdx.x * blockDim.x + threadIdx.x;

unsigned int yIndex = blockIdx.y * blockDim.y + threadIdx.y;

unsigned int index_in = yIndex * width + xIndex;



// read the matrix tile into shared memory

if((xIndex < width) && (yIndex < height))

{

	block[threadIdx.y][threadIdx.x] = idata[index_in];

}

__syncthreads();

// write to global memory

if((xIndex < width) && (yIndex < height))

{

	//Write output

	odata[index_in] = block[threadIdx.y][threadIdx.x];;

}

}

extern “C” global void Shift_kernel(float *odata, float *idata, int width, int height)

{

__shared__ float block[BLOCK_DIM+1][BLOCK_DIM+1];



unsigned int xIndex = blockIdx.x * blockDim.x + threadIdx.x;

unsigned int yIndex = blockIdx.y * blockDim.y + threadIdx.y;

unsigned int index_in = yIndex * width + xIndex;



// read the matrix tile into shared memory

if((xIndex < width - 1) && (yIndex < height - 1) && (xIndex > 1) && (yIndex > 1))

{

	block[threadIdx.y][threadIdx.x] = idata[yIndex * width + (xIndex-1)];

}



__syncthreads();



// write to global memory

if((xIndex < width) && (yIndex < height))

{

    // image update

	odata[index_in] = block[threadIdx.y][threadIdx.x]; 

}

}

[/codebox]

This is being called by a function I have in C# using the CUDA.NET library that looks like this:

[codebox] public void Next_Image(Bitmap BMP)

    {

        width = BMP.Width;

        height = BMP.Height;

h_idata = Bitmap_To_Float1D(BMP);

        d_idata = cuda.CopyHostToDevice<float>(h_idata);

        d_odata = cuda.Allocate<float>(h_idata);

//Copy Image

        cuda.SetFunctionBlockShape(fxn_Copy, 16, 16, 1);

        cuda.SetParameter(fxn_Copy, 0, (uint)d_odata.Pointer);

        cuda.SetParameter(fxn_Copy, IntPtr.Size, (uint)d_idata.Pointer);

        cuda.SetParameter(fxn_Copy, IntPtr.Size * 2, (uint)width);

        cuda.SetParameter(fxn_Copy, IntPtr.Size * 2 + 4, (uint)height);

        cuda.SetParameterSize(fxn_Copy, (uint)(IntPtr.Size * 2 + 8));

        cuda.Launch(fxn_Copy, width / 16 + 1, height / 16 + 1);

        cuda.SynchronizeContext();

//Do Shift

        cuda.SetFunctionBlockShape(fxn_Shift, 16, 16, 1);

        cuda.SetParameter(fxn_Shift, 0, (uint)d_odata.Pointer);

        cuda.SetParameter(fxn_Shift, IntPtr.Size * 1, (uint)d_odata.Pointer);

        cuda.SetParameter(fxn_Shift, IntPtr.Size * 2, (uint)width);

        cuda.SetParameter(fxn_Shift, IntPtr.Size * 2 + 4, (uint)height);

        cuda.SetParameterSize(fxn_Shift, (uint)(IntPtr.Size * 2 + 8));

        cuda.Launch(fxn_Shift, width / 16 + 1, height / 16 + 1);

        cuda.SynchronizeContext();

//Copy output data

        h_odata = new float[width * height];

        cuda.CopyDeviceToHost<float>(d_odata, h_odata);

//Clean up

        cuda.Free(d_idata);

        cuda.Free(d_odata);

    }[/codebox]

The problem is that each time I run this set of kernels serially on the image, random errors (pixels with incorrect values) pop up on the rows and columns that seem to have indexes that are a multiple of 16 (happens to be the block size). Its not the entire row or column that is wrong either, just random pixels here and there that lie on the rows/columns, and the ones that are wrong differ each time. The problem seems to come from the fact that in the Shift_kernel the threads are reading the pixel 1 column left of their designated pixel (yIndex * width + (xIndex-1)). If they read from their area then there is no error (yIndex * width + xIndex), however thats just really copying the data twice which isn’t very interesting. To see this error you also have to run both kernels serially. If you run either/or then there is no error. And just to clarify, I am running the kernels on a fresh copy of the bitmap each time (not re-running same bitmap in memory which could accumulate errors). This seems to me like it could be some memory bank error, but I’m not sure and have no idea how to fix it. One final fact of interest is that these errors show up whether I compile/run this in 32-bit mode on the Geforce 8800 GTX, or in 64-bit mode on a Tesla C1060, but there appear to be many more errors with the 64-bit Tesla combo. Thanks

I think that you can post CPU code of shift_kernel to help us what the shift-kernel does.

as far as I am concerned, I suppose that you want to shift colume of image inside an image with one row and one column

space far from boundary. You don’t need shared memory, and I modify your kernel as

extern "C" __global__ void Shift_kernel(float *odata, float *idata, int width, int height)

{	

		unsigned int xIndex = blockIdx.x * blockDim.x + threadIdx.x;	

		unsigned int yIndex = blockIdx.y * blockDim.y + threadIdx.y;	

		unsigned int index_in = yIndex * width + xIndex;

	

	// read the matrix tile into shared memory	

	

	if((xIndex < width - 1) && (yIndex < height - 1) && (xIndex > 1) && (yIndex > 1))	{		

		odata[index_in] = idata[yIndex * width + (xIndex-1)];	

	}		

}

the problem is boundary condition when doing “write” operation

Thanks for the quick reply. I tried your changes, but the same error still occurs. The C# function I posted is the code that calls both kernels serially. It doesn’t appear to be a boundary writing problem because either kernel by itself works just fine, its only when running them serially that the errors occur.

One more observation to add is that on the 880GTX/32-bit machine the worst errors are when ready from one column behind (idata[yIndex * width + (xIndex-1)]), while on the C1060/64-bit combo, the worst errors are when ready one row behind (idata[(yIndex-1) * width + xIndex]). I would love to know if this makes sense to anybody and if anyone else has had to solve this before. Thanks

Problem solved!!! This was a case of programmer error and a learning experience in multi-threaded programming. The problem was simply that the second kernel I called that was shifting the image was using the same pointer as the input and output so some cores were running thread near parts of the image that had already been shifted, and others hadn’t. In the end, make sure your write to a different output than your input if your algorithm is doing calculations on elements of your data that are dependent upon other elements of your data.