unsigned char pointer in a kernel

hi,

i am pretty confused because my code works fine in emulation mode but not in GPU mode…

and i apologize if my problem could look like a very newbie one

here a sample of my code which should transform a 32bit coulored image (the bip pointer) in a 8bit grey scale image (bop) :

post function is like the sprintf

extern "C" void cu_hello(t_cu_jit_hello *x, long dimcount, long *dim, long planecount, 

	t_jit_matrix_info *in_minfo, char *bip, t_jit_matrix_info *out_minfo, char *bop);

// Kernel that executes on the CUDA device

__global__ void rgb2luma_kernel(unsigned char *a, unsigned char *b, long *indimstride, long *outdimstride, long *dim, long ascale, long rscale, long gscale, long bscale)

{

	//int i = blockIdx.x * blockDim.x + threadIdx.x;

	//int j = blockIdx.y * blockDim.y + threadIdx.y;

	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	int i = idx%dim[0];

	int j = idx/dim[0];

	if ( j < dim[1] )

	{

		if ( i < dim[0] )

		{

			a += i*indimstride[0] + j*indimstride[1];

			b += i + j*outdimstride[1]; //assumes that b is a one plane 2D matrix

			*b = (unsigned char) ((*a * ascale + *(a+1) * rscale + *(a+2) * gscale + *(a+3) * bscale)/255);

		}

	}

}

// main routine that executes on the host

void cu_hello(t_cu_jit_hello *x, long dimcount, long *dim, long planecount, 

	t_jit_matrix_info *in_minfo, char *bip, t_jit_matrix_info *out_minfo, char *bop)

{

	unsigned char *bip_d, *bop_d;  // Pointer to host & device arrays

	const int insize  = in_minfo->dimstride[1]*dim[1];  // Number of elements in input arrays

	const int outsize = out_minfo->dimstride[1]*dim[1]; // Number of elements in output arrays

	long w=dim[0], h=dim[1];

	double fascale = x->ascale;

	double frscale = x->rscale;

	double fgscale = x->gscale;

	double fbscale = x->bscale;

	long ascale = fascale * 65536.;

	long rscale = frscale * 65536.;

	long gscale = fgscale * 65536.;

	long bscale = fbscale * 65536.;

	dim3 dimBlock(x->dimBlock[0], x->dimBlock[1]); // number of threads in one block

	dim3 dimGrid(w/dimBlock.x + (w%dimBlock.x == 0 ? 0:1),h/dimBlock.x + (h%dimBlock.x == 0 ? 0:1)); // number of blocks per grid

	// Allocate array on device

	if ( cudaSuccess != cudaMalloc((void **) &bip_d, insize*sizeof(unsigned char)) 

		|| cudaSuccess != cudaMalloc((void **) &bop_d, outsize*sizeof(unsigned char)))

	{

		post("CUDA can't allocated device memory.");

		goto out;

	}

	if ( cudaSuccess != cudaMemcpy(bip_d, (unsigned char *) bip, insize*sizeof(unsigned char), cudaMemcpyHostToDevice) )

	{

		post("CUDA can't copy data from host to device");

		goto out;

	}

	// Do calculation on device:

	rgb2luma_kernel <<< dimGrid, dimBlock >>> (bip_d, bop_d, in_minfo->dimstride, out_minfo->dimstride, dim, ascale, rscale, gscale, bscale);

	// Retrieve result from device and store it in host array

	if (cudaSuccess != cudaMemcpy(bop, (char *) bop_d, outsize*sizeof(unsigned char), cudaMemcpyDeviceToHost))

	{

		post("CUDA can't copy data from host to device");

		goto out;

	}

out:

	// Cleanup

	if ( cudaSuccess != cudaFree(bip_d) || cudaSuccess != cudaFree(bop_d)) post("CUDA can't free device pointer.");

}

in normal mode (no emulation), when i assign a value to the b pointer in the global function, the programm crashes…

whereas it works fine in emulation mode

should I pass the pointer adress in another way to the global function ?

many thanks for your help and fell free to contact me for any precision

best

ChianLi

To precise some little thing, i noticed that it seems to be a problem in my kernel because it got an error when i tried to copy data back from device to host with cudaMemcpy(bop, (char ) bop_d, outsizesizeof(unsigned char), cudaMemcpyDeviceToHost) after passing throught the kernel whereas if i bypass the kernel i can read back the data to host

so i still wonder where is my mistake ?
i am sure it’s a huge newbie mistake but i can’t find it…

please help me !

thanks a lot

ChianLi

it seems that as soon as i modify the bop_d value in the kernel, the pointer became corrupted and then i can’t get it back to the host
also i can’t free device and host pointers when bop_d is corrupted

any idea ?

I’m pretty sure it’s in the main loop body.

Try changing

[codebox]

a += iindimstride[0] + jindimstride[1];

b += i + j*outdimstride[1]; //assumes that b is a one plane 2D matrix

*b = (unsigned char) ((*a * ascale + *(a+1) * rscale + *(a+2) * gscale + *(a+3) * bscale)/255);

[/codebox]

To

[codebox]

unsigned char* a_idx = &a[i*indimstride[0] + j*indimstride[1]];

unsigned char* b_idx = &b[i + j*outdimstride[1]]; //assumes that b is a one plane 2D matrix

*b_idx = (unsigned char) ((*a_idx * ascale + *(a_idx+1) * rscale + *(a_idx+2) * gscale + *(a_idx+3) * bscale)/255);

[/codebox]

The reason I think this might be your problem is that the parameters that are passed to the kernel function are put into memory that is shared between the threads (at least, that would make the most sense). Your “a +=” and “b +=” lines are therefore changing shared memory that is used in many different threads (creating possible data hazards all over the place).

Anyway, that’s what jumped out at me…if that’s not it I’d have to look closer.

hi Thummel,

thanks for your reply, but it still doesn’t work

i was wondering if the problem might be due to shared memory and try for example this :

b[i + j*outdimstride[1]] = (unsigned char) 255;

or even simpler :

*b = (unsigned char) 255;

instead of the 3 lines but it doesn’t work too

so it might not be a problem of pointing an unallocated memory area ?

but what else ? a memory allocation problem ? the pointer doesn’t have the expected size ?

if you have a little bit more time to help i will appreciate!

thanks again

ChianLi

Okay, I took a second look at it and it looks like you’re passing a bunch of pointers to host data to the kernel function. Unless there is some way that I don’t know of, anything that you are going to pass (all parameters) to the kernel needs to have space created for it with cudaMalloc and, if you care about the value of the data entering the kernel, the data copied to it using cudaMemcpy.

So, the reason it is crashing is that the access to dim[0] on the second line of the kernel is illegal and is crashing the program. It works in emulation mode because there is no separation between host and device memory spaces in emulation mode.

In fact, the accesses to dim, indimstride, and outdimstride are all illegal, and the accesses to ascale, rscale, gscale, and bscale probably are as well (I’m not sure if you can pass ints directly to a kernel function… I’ve never tried).

Once again…this is just what jumped out at me tonight…ask again tomorrow if it still doesn’t work :)

many many many thanks !
it’s pretty obvious now !

i was searching my mistake for one week !
and i had just identify that the issue might come from the arguments of the kernel call when i saw your answer

thanks a lot, i would kiss you if i could

and yes i can pass ints directly to the kernel and this put me on the wrong way…

thanks again !

best

ChianLi