I’ve been trying to wrap my head around a problem I’m having with an image processing application in Cuda for a few days ago, but can’t make any sense of it. I’ve created a very simple set of Cuda kernels to reproduce the problem and will post them below. I’m really not just trying to copy and shift my image, but this simple sample code has the same error I’m trying to solve:
[codebox]extern “C” global void Copy_kernel(float *odata, float *idata, int width, int height)
{
__shared__ float block[BLOCK_DIM+1][BLOCK_DIM+1];
unsigned int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int yIndex = blockIdx.y * blockDim.y + threadIdx.y;
unsigned int index_in = yIndex * width + xIndex;
// read the matrix tile into shared memory
if((xIndex < width) && (yIndex < height))
{
block[threadIdx.y][threadIdx.x] = idata[index_in];
}
__syncthreads();
// write to global memory
if((xIndex < width) && (yIndex < height))
{
//Write output
odata[index_in] = block[threadIdx.y][threadIdx.x];;
}
}
extern “C” global void Shift_kernel(float *odata, float *idata, int width, int height)
{
__shared__ float block[BLOCK_DIM+1][BLOCK_DIM+1];
unsigned int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int yIndex = blockIdx.y * blockDim.y + threadIdx.y;
unsigned int index_in = yIndex * width + xIndex;
// read the matrix tile into shared memory
if((xIndex < width - 1) && (yIndex < height - 1) && (xIndex > 1) && (yIndex > 1))
{
block[threadIdx.y][threadIdx.x] = idata[yIndex * width + (xIndex-1)];
}
__syncthreads();
// write to global memory
if((xIndex < width) && (yIndex < height))
{
// image update
odata[index_in] = block[threadIdx.y][threadIdx.x];
}
}
[/codebox]
This is being called by a function I have in C# using the CUDA.NET library that looks like this:
[codebox] public void Next_Image(Bitmap BMP)
{
width = BMP.Width;
height = BMP.Height;
h_idata = Bitmap_To_Float1D(BMP);
d_idata = cuda.CopyHostToDevice<float>(h_idata);
d_odata = cuda.Allocate<float>(h_idata);
//Copy Image
cuda.SetFunctionBlockShape(fxn_Copy, 16, 16, 1);
cuda.SetParameter(fxn_Copy, 0, (uint)d_odata.Pointer);
cuda.SetParameter(fxn_Copy, IntPtr.Size, (uint)d_idata.Pointer);
cuda.SetParameter(fxn_Copy, IntPtr.Size * 2, (uint)width);
cuda.SetParameter(fxn_Copy, IntPtr.Size * 2 + 4, (uint)height);
cuda.SetParameterSize(fxn_Copy, (uint)(IntPtr.Size * 2 + 8));
cuda.Launch(fxn_Copy, width / 16 + 1, height / 16 + 1);
cuda.SynchronizeContext();
//Do Shift
cuda.SetFunctionBlockShape(fxn_Shift, 16, 16, 1);
cuda.SetParameter(fxn_Shift, 0, (uint)d_odata.Pointer);
cuda.SetParameter(fxn_Shift, IntPtr.Size * 1, (uint)d_odata.Pointer);
cuda.SetParameter(fxn_Shift, IntPtr.Size * 2, (uint)width);
cuda.SetParameter(fxn_Shift, IntPtr.Size * 2 + 4, (uint)height);
cuda.SetParameterSize(fxn_Shift, (uint)(IntPtr.Size * 2 + 8));
cuda.Launch(fxn_Shift, width / 16 + 1, height / 16 + 1);
cuda.SynchronizeContext();
//Copy output data
h_odata = new float[width * height];
cuda.CopyDeviceToHost<float>(d_odata, h_odata);
//Clean up
cuda.Free(d_idata);
cuda.Free(d_odata);
}[/codebox]
The problem is that each time I run this set of kernels serially on the image, random errors (pixels with incorrect values) pop up on the rows and columns that seem to have indexes that are a multiple of 16 (happens to be the block size). Its not the entire row or column that is wrong either, just random pixels here and there that lie on the rows/columns, and the ones that are wrong differ each time. The problem seems to come from the fact that in the Shift_kernel the threads are reading the pixel 1 column left of their designated pixel (yIndex * width + (xIndex-1)). If they read from their area then there is no error (yIndex * width + xIndex), however thats just really copying the data twice which isn’t very interesting. To see this error you also have to run both kernels serially. If you run either/or then there is no error. And just to clarify, I am running the kernels on a fresh copy of the bitmap each time (not re-running same bitmap in memory which could accumulate errors). This seems to me like it could be some memory bank error, but I’m not sure and have no idea how to fix it. One final fact of interest is that these errors show up whether I compile/run this in 32-bit mode on the Geforce 8800 GTX, or in 64-bit mode on a Tesla C1060, but there appear to be many more errors with the 64-bit Tesla combo. Thanks