Conversion from RGB to gray scale using CUDA

Hello,
can somebody help. I am writing my code to loop through every pixel of RGB nad convert it to gray scale but it doesn’t work!. It compiles but the results are not as expected.

I have defined
#define DIM 512
dim3 grid(DIM,DIM);

and my kernel call looks like this:

gpuConvert <<<grid, 1>>>(dev_ptrout, dev_ptrin);

My function looks like this:

global void gpuConvert(unsigned char *b, const unsigned char *a)
{
// Calculate the index of the pointers based on pixel location for each thread
int x = blockIdx.x; // current row
int y = blockIdx.y; // current column

int dim= gridDim.x;

//	Perform the convertion
 int index = x + y*dim;       // current pixel index 
    
     b[index]  =  (a[index]+ a[index+1]+ a[index+2])/3.0f; 		

}

Am I missing something?
Please help! I am lost

Apparently your input pixel format is RGB unsigned char. That means you have to account for the 3 bytes per pixel in your pointer calculations.

What about your output pixel format? Is it also RGB 3 bytes unsigned char? Or something else?
If it is a 3-byte output format, something like this might work:

int index = x + ydim; // current pixel index
int iindex = 3
index; // compute byte offset
int oindex = 3*index;
unsigned char temp = (a[iindex]+ a[iindex+1]+ a[iindex+2])/3.0f;

b[oindex] = temp; // assuming 3 byte output format
b[oindex+1] = temp; // assuming 3 byte output format
b[oindex+2] = temp; // assuming 3 byte output format

if it’s a 1-byte output format, something like this:

int index = x + ydim; // current pixel index
int iindex = 3
index; // compute byte offset
int oindex = index;
unsigned char temp = (a[iindex]+ a[iindex+1]+ a[iindex+2])/3.0f;

b[oindex] = temp; // assuming 1 byte output format

There’s a lot of other comments that could be made, but keep it simple until you get your code to work.

a quick question: I used the 1-byte output you suggested. It compiles but the image appears only gray at the top (about 1/4 of the window) and the rest is like distorted…
when this happen, what should I look to fix first? the statements inside the gpu kernel OR the kernel call by changing the parameters inside the <<<>>> ? Or something else?
I am lost here!

Based on the code you sent me privately, I would look at the sizes of the arrays you are allocating.

Thanks
you mean to change the sizeof(unsigned char) to size of(something else)?
I already tired this if this is what you mean
:(

Right now you have some code like this in your convertwithCUDA function:

// Allocate the input pointer on the device.
cudaStatus = cudaMalloc( (void**) & dev_ptrin, sizeof(unsigned char) * input.rows * input.cols);

// Allocate the output pointer on the device.
cudaStatus = cudaMalloc( (void**) & dev_ptrout, sizeof(unsigned char) * grayImage.rows * grayImage.cols);

Does it make sense to use sizeof(unsigned char) for both?

How many unsigned char does it take to represent a pixel in the input image?
How many unsigned char does it take to represent a pixel in the output image?

Even though you made the change in the kernel, you’re not really grasping that the input image has 3 bytes per pixel and the output image has 1 byte per pixel.

And you’ll need to modify your cudaMemcpy operations accordingly, as well.

I try to change the sizeof from char to long in the following statements:

global void gpuConvert(unsigned char *b, const unsigned long *a)
only the second parameter here

unsigned long temp = (a[offset]+ a[offset+1]+ a[offset+2])/3.0f;
this statement is inside the kernel

unsigned long *dev_ptrin;

cudaStatus = cudaMalloc( (void**) & dev_ptrin, sizeof(unsigned long ) * input.rows * input.cols);

cudaStatus = cudaMemcpy(dev_ptrin, input.data, sizeof(unsigned long ) * input.rows * input.cols, cudaMemcpyHostToDevice);

cudaStatus = cudaMemcpy(dev_ptrin, input.data, sizeof(unsigned long ) * input.rows * input.cols, cudaMemcpyHostToDevice);

but it doesn’t work. I probably (more possible) don’t follow your exact directions…that;s why…but as I said I am a self learner and I am already learning a lot from you…

ok should I change to float or int?
also should I change the following
Mat gpuConvertedImage (inputImage.rows, inputImage.cols, CV_8UC1)
to
Mat gpuConvertedImage (inputImage.rows, inputImage.cols, CV_32FC1)
?

Any more help greatly appreciated.
Or, does anyone know a good link to start with regarding my issue and to make more clear everything?

Try this blog post. It includes a completely worked sample code for what you are trying to do.

http://www.programmerfish.com/how-to-write-a-custom-cuda-kernel-with-opencv-as-host-library/

Thank you. I will check it.
:)