Problem in converting 1D linear array--->(chars to intergers )

Using GeForce GT 630 on i5 quad core, I’m trying to process 1920x1080 video frame captured from a camera,
the converted YUV data is passed to the function xCopyToImageBuffer to convert the char
elements to integers before executing the main process required to do with this data. At the moment im
just trying to familiarized myself with cuda parallel programming, therefore I have created the following
code to copy the YUV data (processing one YUV component at a time) and convert the image chars to integers,
process the integers data and copy back to buffer as chars, and finally write into a file as YUV.
This works fine with “nested for loop” copying the data as integer one by one from the picSrc to ImageBuffer in the .cpp file.

The target frame resolution is Full HD (1920x1080 ), but the function xCopyToImageBuffer returns with nothing if the
frame resolution exceeds 320x240, yet the data at this resolution or even lower is not correct.

Based on the system features:

The frame 1920x1080 = 2073600 elements
With 2.1 computational capability means can have up to 1024 threads


dimBlock ( blocksize ); // block size = dimBlock.x = total number of threads per block
dimGrid ( ceil(byteCount / (float)blocksize)); // total number of blocks = dimGrid.x = 2025 blocks

And the number of blocks here does not exceeds the limit of a grid (65535), thus I assume no need to structured the grid differently,
but still not sure where i am wrong in the following code.

.cu file
global void CpyToBufferKernel (unsigned int* pDst, unsigned char* pScr, unsigned int byteCount)

int Idx = blockIdx.x * blockDim.x + threadIdx.x;
if(Idx < byteCount )
pDst[idx] = (int)pSrc[idx];

extern "C" void xCopyToImageBuff(
UInt* pDst,
UChar* pSrc,
UInt byteCount
int blocksize = 1024;
dim3 dimBlock ( blocksize );
dim3 dimGrid ( ceil(byteCount / (float)blocksize));

CpyToBufferKernel<<< dimGrid, dimBlock >>>(pDst, pSrc, byteCount );


.cpp file

void xCopyToImageBuffer(unsigned char* picSrc, int width, int height){

unsigned int* ImageBuffer;
unsigned int* d_Data_Out;
unsigned char* d_Data_In;

unsigned int byteCount = width * height;
ImageBuffer = new unsigned int[byteCount];

cudaMalloc((void **)&d_Data_In, byteCount);
cudaMalloc((void **)&d_Data_Out, byteCount);

cudaMemcpy(d_Data_In, picSrc, byteCount, cudaMemcpyHostToDevice);
xCopyToImageBuff(d_Data_Out, d_Data_In, byteCount);
cudaMemcpy (ImageBuffer, d_Data_Out, byteCount, cudaMemcpyDeviceToHost);


//Process ImageBuffer Here and copy data back
//convert the data back and save the processed YUV data into a file


Please I need help on this, I have been strangling with this for almost a week now.

Problem solved in .cpp file

cudaMalloc((void **)&d_Data_In, byteCount * sizeof(unsigned char ));
cudaMalloc((void **)&d_Data_Out, byteCount * sizeof(unsigned int ));

cudaMemcpy(d_Data_In, picSrc, byteCount* sizeof(unsigned char), cudaMemcpyHostToDevice);
xCopyToImageBuff(d_Data_Out, d_Data_In, byteCount);
cudaMemcpy (ImageBuffer, d_Data_Out, byteCount * sizeof(unsigned int), cudaMemcpyDeviceToHost);


You probably would like to change the kernel and calling code to process uchar4 and float4
at once instead of char to float. That should probably result in better performance of the copy kernel.