program crash when copying from device to host <br />

Hello there.
While programming in CUDA (comparing into the kernel two images sent as unsigned char*, computing the square RGB difference of each image’s pixel into an element of device uint64_t cud_err[200][200] matrix) I am stuck in the phase of copying the data back from the device -> host.

Here is the host code:

CallKernelGenetics(const unsigned char *dline, const unsigned char *sline, int h, int w, uint16_t pitch, uint16_t bpp)
uint64_t err_tot = (uint64_t)malloc(sizeof(uint64_t)); // this is the host variable use to initialize the device one (cud_error_tot)
*err_tot = 3;

	CUDA_SAFE_CALL(cudaMalloc((void**)&cud_err_tot, sizeof(uint64_t)));
	CUDA_SAFE_CALL(cudaMemcpy(cud_err_tot, err_tot, sizeof(uint64_t), cudaMemcpyHostToDevice));

	CUDA_SAFE_CALL(cudaMalloc((void **)&cud_dline, w * h * sizeof(unsigned char))); // the first image
	CUDA_SAFE_CALL(cudaMemcpyToSymbol(cud_dline, dline, w * h * sizeof(unsigned char), cudaMemcpyHostToDevice));

            CUDA_SAFE_CALL(cudaMalloc((void **)&cud_sline, w * h * sizeof(unsigned char))); // the second image
	CUDA_SAFE_CALL(cudaMemcpyToSymbol(cud_sline, sline, w * h * sizeof(unsigned char), cudaMemcpyHostToDevice));

	dim3  threads( 16, 16);
	dim3  grid( w/threads.x, h/threads.y);

	testGufKernel<<<grid, threads, 0>>>(bpp, pitch, h, w, cud_err_tot);

	CUDA_SAFE_CALL(cudaMemcpy(err_tot, cud_err_tot, sizeof(uint64_t), cudaMemcpyDeviceToHost)); // here is the problem


	return err_tot[0];


And here is the device code:

device unsigned char *cud_dline;
device unsigned char *cud_sline;
device uint64_t *cud_err_tot;
device uint64_t cud_err[200][200];

global void
testGufKernel(uint64_t cud_bpp, uint64_t cud_pitch, int h, int w, uint64_t* cud_err_tot)
//-- Get the location of the texel corresponding the thread ID
const unsigned int tidx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int tidy = blockIdx.y * blockDim.y + threadIdx.y;

cud_dline += tidx * cud_pitch +  tidy * cud_bpp;
cud_sline += tidx * cud_pitch +  tidy * cud_bpp;

				int er = (int)(cud_dline[0]) - (int)(cud_sline[0]);
				int eg = (int)(cud_dline[1]) - (int)(cud_sline[1]);
				int eb = (int)(cud_dline[2]) - (int)(cud_sline[2]);
				int er = (int)(cud_dline[1]) - (int)(cud_sline[1]);
				int eg = (int)(cud_dline[2]) - (int)(cud_sline[2]);
				int eb = (int)(cud_dline[3]) - (int)(cud_sline[3]);

if(tidx < w && tidy < h)
	cud_err[tidx][tidy] = ((er * er) + (eb * eb) + (eg * eg));

if(tidx == 0 && tidy == 0)
	*cud_err_tot = cud_err[0][0];


Well the problem shows up when executing the line:

	CUDA_SAFE_CALL(cudaMemcpy(err_tot, cud_err_tot, sizeof(uint64_t), cudaMemcpyDeviceToHost));

The problem is in the kernel attribution -> cud_err[tidx][tidy] = ((er * er) + (eb * eb) + (eg * eg));

If I try instead cud_err[tidx][tidy] = 12 for example it works fine and the host variable err_tot has the right value.

I hope you can help me cause I am driving nuts trying different approaches to solve this.

I have an NVIDIA GeForce 8600 M GT and trying on Windows 7 x64, but I don’t think the windows should be a problem cause the same problem
showed up also on Windows XP x64.


i am not really sure about it, but i think the problem could be those lines:

I think this would result in: every thread changing the value of the pointer. so the address you are using afterwards changes (?).

and you said the problem is the other line. that would make sense because there you actually use the data. if you dont use it

the compiler could be so clever to just skip the lines where you copy the data.

so i would try to use a local variable for that pointer. like:

unsigned char *pDline=cud_dline + tidx * cud_pitch + tidy * cud_bpp;

unsigned char *pSline=cud_sline + tidx * cud_pitch + tidy * cud_bpp;

not sure if this works

Thanks for your reply st3fan82 but unfortunately it did not solve my problem.

Of course, it was wrong what I did in my code, but it seems something else makes the error show up.

I also tried to put those 2 images (unsigned char*) as parameters for the kernel with no results, even if I realize that it doesn’t matter cause
I don’t want to use any of the values of those 2 pointers after the kernel executes.

If you come up with another ideas, please share them with me.

ok, I found another thing that could be a problem.

same lines:

cud_dline += tidx * cud_pitch + tidy * cud_bpp;

cud_sline += tidx * cud_pitch + tidy * cud_bpp;

yes i cant stop thinking about it. ;)

if I understand you right. you are calculating the offset of the pixels you want to work with.

i actually wrote something similar. (also a topic here and i am waiting for help )

and i was calculating the pixeloffset like that:

unsigned char *pDline=cud_dline + tidy * cud_pitch + tidx * cud_bpp;

unsigned char *pSline=cud_sline + tidy * cud_pitch + tidx * cud_bpp;

so the difference is that i multiplicate the pitch with y, because afaik is the pitch the linesize in byte.

after all there is for me the question is the cud_bpp bytes per pixel? or bits?

last thing would be bad.

in my case there was always 3bytes per pixel -> so 1byte per colorvalue what means a char for red green or blue.

so this could also result in an illegal memoryoperation

so i did it in my code like that:

// get the pixel

int elem=y*pitch+3*x;

uchar3 pixel=make_uchar3(((unsigned char)ImgPtr[elem]), ((unsigned char)ImgPtr[elem+1]), ((unsigned char)ImgPtr[elem+2]));

in my code x is your tidx and y is tidy, not sure if this is a fast way to do it

if you dio it like that you can acces every byte/char with pixel.x pixel.y …

and there should also be an uchar4

by the way now I have a question, about those global variable like your image pointers.

are they also global on a device or did every thread get a copy. (never thought about it thanks to you ;) )

this is the reason why i said in first post that i am not sure about.

Thanks again, now I am re-writing in you style. Hope it works.

About your question, but as much as I read from CUDA manual programming 2.0 I learned
that if the variables are declared outside the kernel are global to every thread.
Every thread would have taken a copy of those 2 images only if it was passed as an argument as I know.

Sincerely declaring them as global variables wasn’t my first solution, but I needed to allocate the two cuda variables
in order to work with them.

This did not work as well st3fan82

My code became:

int elem = tidy * cud_pitch + cud_bpp * tidx;
uchar3 dpixel;
uchar3 spixel;

dpixel = make_uchar3(((unsigned char)cud_dline[elem]), ((unsigned char)cud_dline[elem+1]), ((unsigned char)cud_dline[elem+2]));
spixel = make_uchar3(((unsigned char)cud_sline[elem]), ((unsigned char)cud_sline[elem+1]), ((unsigned char)cud_sline[elem+2]));
int er = (int)(dpixel.x) - (int)(spixel.x);
int eg = (int)(dpixel.y) - (int)(spixel.y);
int eb = (int)(dpixel.z) - (int)(spixel.z);
dpixel = make_uchar3(((unsigned char)cud_dline[elem+1]), ((unsigned char)cud_dline[elem+2]), ((unsigned char)cud_dline[elem+3]));
spixel = make_uchar3(((unsigned char)cud_sline[elem+1]), ((unsigned char)cud_sline[elem+2]), ((unsigned char)cud_sline[elem+3]));
int er = (int)(dpixel.x) - (int)(spixel.x);
int eg = (int)(dpixel.y) - (int)(spixel.y);
int eb = (int)(dpixel.z) - (int)(spixel.z);

The error is still bind with the line of code

if(tidx < h && tidy < w)
cud_err[tidx][tidy] = ((er * er) + (eb * eb) + (eg * eg));

and again, replacing cud_err[tidx][tidy] with any random value it works fine and host gets the correct value.

I also thought about the dimension of the matrix I use here.
But it’s declared as device uint64_t cud_err[200][200], uint64_t being in SDL libraries a very high number(or I guess so, being represented on 8 bytes).
So I suppose it’s not an unnoticed overflow by adding three square integers.
Don’t have any idea

i also think an overflow cant be the problem, because the worst case in your kernel would be three times square(255) ( ca 200.000)
because the er eg eb are calculated from chars (0-255).
and so an 32bit integer should also do the trick
only idea i have now that your cud_err matrix is always 200 * 200 is this really the maximum size of the images?

200 * 200 is a generic measure I gave only to try to fix the trouble mentioned earlier :)
Instead it can be bigger, but for now, until I fix the problem, I have an input image of 150 * 150 pixels.
So it shouldn’t be a problem…

ah i found something else: your mallocs
they are always whsizeof(char)
but per pixel you have 4 bytes right?
so whsizeof(char)4 should be right
much better would be just h * pitch, because pitch is given in byte
and as i read the bytesperpixel
w could be less than pitch

ok now it has to work ;)
good luck

Thanks… I’ll try after I finish work and let you know the results.
Don’t know how come I didn’t notice this fatal mistake.
Hope this makes it work at last

No result man. :(
I kept your last opinion in code, allocate h * sizeof(unsigned char) * pitch for those two images.

Observed that for

unsigned char *pDline=cud_dline + tidx * cud_pitch + tidy * cud_bpp;
unsigned char *pDline=cud_dline + tidx * cud_pitch + tidy * cud_bpp;

Cuda does not complain for (er * er), where int er = (int)(pDline ) - (int)(pSline);
But instead if I try to calculate pSline[0], pSline[1] or pSline[2] and the same for pDline, it does not work.
Maybe it helps you throughout getting to a hint.

Finally solved it.

Instead of what I showed you earlier I did 2 things:

  1. I also included the 2 arrays cud_dline, cud_sline into the kernel parameters

testGufKernel<<<grid, threads, 0>>>(cud_dline, cud_sline, bpp, pitch, h, w, cud_err_tot);

  1. I used cudaMemcpy instead of cudaMemcpyToSymbol

Thanks again for your help st3fan82