Problem with 1D buffer data processing, Please I need cuda experts help.

Using GeForce GT 630 on i5 quad core, I’m trying to process 1920x1080 video frame captured from a camera, the converted YUV data is passed to the function "xCopyToImageBuffer" to convert the char elements to integers before executing the main process required to do with this data. At the moment im just trying to familiarized myself with cuda parallel programming, therefore I have created the following code to copy the YUV data (processing one YUV component at a time) and convert the image chars to integers, process the integers data and copy back to buffer as chars, and finally write into a file as YUV. This works fine with "nested for loop" copying the data as integer one by one from the picSrc to ImageBuffer in the .cpp file.

The target frame resolution is Full HD (1920x1080 ), but the function "xCopyToImageBuffer" returns with nothing if the frame resolution exceeds 320x240, yet the data at this resolution or even lower is not correct.

Based on the system features:

The frame 1920x1080 = 2073600 elements
With 2.1 computational capability means can have up to 1024 threads

Therefore:

dimBlock ( blocksize ); // block size = dimBlock.x = total number of threads per block
dimGrid ( ceil(byteCount / (float)blocksize)); // total number of blocks = dimGrid.x = 2025 blocks

And the number of blocks here does not exceeds the limit of a grid (65535), thus I assume no need to structured the grid differently, but still not sure where i am wrong in the following code.

///////////////////////////////////////////////////////////////////////////////////////////////////
.cu file
global void CpyToBufferKernel (unsigned int* pDst, unsigned char* pScr, unsigned int byteCount)
{

int Idx = blockIdx.x * blockDim.x + threadIdx.x;
if(Idx < byteCount )
{
pDst[idx] = (int)pSrc[idx];
}
}

extern "C" void xCopyToImageBuff(
UInt* pDst,
UChar* pSrc,
UInt byteCount
)
{
int blocksize = 1024;
dim3 dimBlock ( blocksize );
dim3 dimGrid ( ceil(byteCount / (float)blocksize));

CpyToBufferKernel<<< dimGrid, dimBlock >>>(pDst, pSrc, byteCount );
cudaThreadSynchronize();
}

///////////////////////////////////////////////////////////////////////////////////////////////////
.
.
.
///////////////////////////////////////////////////////////////////////////////////////////////////

.cpp file

void xCopyToImageBuffer(unsigned char* picSrc, int width, int height){

unsigned int* ImageBuffer;
unsigned int* d_Data_Out;
unsigned char* d_Data_In;

unsigned int byteCount = width * height;
ImageBuffer = new unsigned int[byteCount];

cudaMalloc((void **)&d_Data_In, byteCount);
cudaMalloc((void **)&d_Data_Out, byteCount);

cudaMemcpy(d_Data_In, picSrc, byteCount, cudaMemcpyHostToDevice);
xCopyToImageBuff(d_Data_Out, d_Data_In, byteCount);
cudaDeviceSynchronize();
cudaMemcpy (ImageBuffer, d_Data_Out, byteCount, cudaMemcpyDeviceToHost);

cudaFree(d_Data_In);
cudaFree(d_Data_Out);
cudaDeviceReset();

//Process ImageBuffer Here and copy data back
//…
//…
//…
//convert the data back and save the processed YUV data into a file
}

///////////////////////////////////////////////////////////////////////////////

Please I need help on this, I have been strangling with this for almost a week now.

I have moved your topic to the appropriate forum.

It looks to me like d_Data_out is too small and it will fault in your kernel on the pDst[idx] assignment.

You declare d_Data_out as an unsigned int but cudaMalloc() in bytes. Multiply byteCount by sizeof(unsigned int).

Thanks Jeff Davis and thanks Allanmac, problem solved and it works fine now