unsigned char pointer in a kernel

chianli · March 16, 2009, 6:36pm

hi,

i am pretty confused because my code works fine in emulation mode but not in GPU mode…

and i apologize if my problem could look like a very newbie one

here a sample of my code which should transform a 32bit coulored image (the bip pointer) in a 8bit grey scale image (bop) :

post function is like the sprintf

extern "C" void cu_hello(t_cu_jit_hello *x, long dimcount, long *dim, long planecount, 

	t_jit_matrix_info *in_minfo, char *bip, t_jit_matrix_info *out_minfo, char *bop);

// Kernel that executes on the CUDA device

__global__ void rgb2luma_kernel(unsigned char *a, unsigned char *b, long *indimstride, long *outdimstride, long *dim, long ascale, long rscale, long gscale, long bscale)

{

	//int i = blockIdx.x * blockDim.x + threadIdx.x;

	//int j = blockIdx.y * blockDim.y + threadIdx.y;

	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	int i = idx%dim[0];

	int j = idx/dim[0];

	if ( j < dim[1] )

	{

		if ( i < dim[0] )

		{

			a += i*indimstride[0] + j*indimstride[1];

			b += i + j*outdimstride[1]; //assumes that b is a one plane 2D matrix

			*b = (unsigned char) ((*a * ascale + *(a+1) * rscale + *(a+2) * gscale + *(a+3) * bscale)/255);

		}

	}

}

// main routine that executes on the host

void cu_hello(t_cu_jit_hello *x, long dimcount, long *dim, long planecount, 

	t_jit_matrix_info *in_minfo, char *bip, t_jit_matrix_info *out_minfo, char *bop)

{

	unsigned char *bip_d, *bop_d;  // Pointer to host & device arrays

	const int insize  = in_minfo->dimstride[1]*dim[1];  // Number of elements in input arrays

	const int outsize = out_minfo->dimstride[1]*dim[1]; // Number of elements in output arrays

	long w=dim[0], h=dim[1];

	double fascale = x->ascale;

	double frscale = x->rscale;

	double fgscale = x->gscale;

	double fbscale = x->bscale;

	long ascale = fascale * 65536.;

	long rscale = frscale * 65536.;

	long gscale = fgscale * 65536.;

	long bscale = fbscale * 65536.;

	dim3 dimBlock(x->dimBlock[0], x->dimBlock[1]); // number of threads in one block

	dim3 dimGrid(w/dimBlock.x + (w%dimBlock.x == 0 ? 0:1),h/dimBlock.x + (h%dimBlock.x == 0 ? 0:1)); // number of blocks per grid

	// Allocate array on device

	if ( cudaSuccess != cudaMalloc((void **) &bip_d, insize*sizeof(unsigned char)) 

		|| cudaSuccess != cudaMalloc((void **) &bop_d, outsize*sizeof(unsigned char)))

	{

		post("CUDA can't allocated device memory.");

		goto out;

	}

	if ( cudaSuccess != cudaMemcpy(bip_d, (unsigned char *) bip, insize*sizeof(unsigned char), cudaMemcpyHostToDevice) )

	{

		post("CUDA can't copy data from host to device");

		goto out;

	}

	// Do calculation on device:

	rgb2luma_kernel <<< dimGrid, dimBlock >>> (bip_d, bop_d, in_minfo->dimstride, out_minfo->dimstride, dim, ascale, rscale, gscale, bscale);

	// Retrieve result from device and store it in host array

	if (cudaSuccess != cudaMemcpy(bop, (char *) bop_d, outsize*sizeof(unsigned char), cudaMemcpyDeviceToHost))

	{

		post("CUDA can't copy data from host to device");

		goto out;

	}

out:

	// Cleanup

	if ( cudaSuccess != cudaFree(bip_d) || cudaSuccess != cudaFree(bop_d)) post("CUDA can't free device pointer.");

}

in normal mode (no emulation), when i assign a value to the b pointer in the global function, the programm crashes…

whereas it works fine in emulation mode

should I pass the pointer adress in another way to the global function ?

many thanks for your help and fell free to contact me for any precision

best

ChianLi

chianli · March 16, 2009, 9:09pm

To precise some little thing, i noticed that it seems to be a problem in my kernel because it got an error when i tried to copy data back from device to host with cudaMemcpy(bop, (char ) bop_d, outsizesizeof(unsigned char), cudaMemcpyDeviceToHost) after passing throught the kernel whereas if i bypass the kernel i can read back the data to host

so i still wonder where is my mistake ?
i am sure it’s a huge newbie mistake but i can’t find it…

please help me !

thanks a lot

ChianLi

chianli · March 16, 2009, 9:43pm

it seems that as soon as i modify the bop_d value in the kernel, the pointer became corrupted and then i can’t get it back to the host
also i can’t free device and host pointers when bop_d is corrupted

any idea ?

thummel · March 17, 2009, 3:48am

I’m pretty sure it’s in the main loop body.

Try changing

[codebox]

a += iindimstride[0] + jindimstride[1];

b += i + j*outdimstride[1]; //assumes that b is a one plane 2D matrix

*b = (unsigned char) ((*a * ascale + *(a+1) * rscale + *(a+2) * gscale + *(a+3) * bscale)/255);

[/codebox]

To

[codebox]

unsigned char* a_idx = &a[i*indimstride[0] + j*indimstride[1]];

unsigned char* b_idx = &b[i + j*outdimstride[1]]; //assumes that b is a one plane 2D matrix

*b_idx = (unsigned char) ((*a_idx * ascale + *(a_idx+1) * rscale + *(a_idx+2) * gscale + *(a_idx+3) * bscale)/255);

[/codebox]

The reason I think this might be your problem is that the parameters that are passed to the kernel function are put into memory that is shared between the threads (at least, that would make the most sense). Your “a +=” and “b +=” lines are therefore changing shared memory that is used in many different threads (creating possible data hazards all over the place).

Anyway, that’s what jumped out at me…if that’s not it I’d have to look closer.

chianli · March 17, 2009, 11:24am

hi Thummel,

thanks for your reply, but it still doesn’t work

i was wondering if the problem might be due to shared memory and try for example this :

b[i + j*outdimstride[1]] = (unsigned char) 255;

or even simpler :

*b = (unsigned char) 255;

instead of the 3 lines but it doesn’t work too

so it might not be a problem of pointing an unallocated memory area ?

but what else ? a memory allocation problem ? the pointer doesn’t have the expected size ?

if you have a little bit more time to help i will appreciate!

thanks again

ChianLi

thummel · March 18, 2009, 2:28am

Okay, I took a second look at it and it looks like you’re passing a bunch of pointers to host data to the kernel function. Unless there is some way that I don’t know of, anything that you are going to pass (all parameters) to the kernel needs to have space created for it with cudaMalloc and, if you care about the value of the data entering the kernel, the data copied to it using cudaMemcpy.

So, the reason it is crashing is that the access to dim[0] on the second line of the kernel is illegal and is crashing the program. It works in emulation mode because there is no separation between host and device memory spaces in emulation mode.

In fact, the accesses to dim, indimstride, and outdimstride are all illegal, and the accesses to ascale, rscale, gscale, and bscale probably are as well (I’m not sure if you can pass ints directly to a kernel function… I’ve never tried).

Once again…this is just what jumped out at me tonight…ask again tomorrow if it still doesn’t work :)

chianli · March 18, 2009, 3:43am

many many many thanks !
it’s pretty obvious now !

i was searching my mistake for one week !
and i had just identify that the issue might come from the arguments of the kernel call when i saw your answer

thanks a lot, i would kiss you if i could

and yes i can pass ints directly to the kernel and this put me on the wrong way…

thanks again !

best

ChianLi

Topic		Replies	Views
program crash when copying from device to host <br /> CUDA Programming and Performance	11	1877	March 31, 2009
Pinned memory error invalid device pointer CUDA Programming and Performance	9	6080	April 10, 2009
Newbie problem with copying between host and device My image is just gr{a\|e}y CUDA Programming and Performance	8	5274	May 30, 2011
pointer in global device memory CUDA Programming and Performance	9	11592	November 23, 2011
most general form for thread access? CUDA Programming and Performance	10	2531	February 21, 2010
Unable to do image processing algorithms on GPU CUDA Programming and Performance cuda , kernel , gpu	5	757	November 20, 2022
Performance issues on memory transfer CUDA Programming and Performance	13	12982	November 26, 2010
cudaBindTexture2D Problem? can't find the reason for this bahavior CUDA Programming and Performance	2	1937	August 19, 2009
CUDA and char* programming CUDA Programming and Performance	8	14957	July 31, 2008
Copying 2D array from host to device CUDA Programming and Performance	7	7244	July 27, 2010

unsigned char pointer in a kernel

Related topics