CUDA error with cudaMemcpy

Hi all, I am new to CUDA (and C++, I was always programming in Matlab).
I am writing comparatively complicated problem, so I will not post all the code here. I got an issue I cannot resolve. After my global kernel I am copying array back to host memory. And on this stage I got error: cudaErrorIllegalAddress(77). I will post some code here, without global kernel:

Pixel  *d_img1,*d_img2;
float *d_I;
float *h_I = (float *)malloc(mem_size_out);
error = cudaMalloc((void **)&d_I, mem_size_out);
CalcilateWithCuda <<< grid, threads >>>(d_img1, d_img2, d_I, step, winsize, 1200);
error = cudaMemcpy(h_I, d_I, mem_size_out, cudaMemcpyDeviceToHost);

So here I post how I determine h_I: only with Malloc function. After kernel I am trying to copy d_I into h_I. It doesn’t work for some reason.

Any guesses what might it be? If there is no error inside kernel, does it mean that everything is ok with it (at least that in order to fix this error I have to work outside of cuda kernel)?

It has nothing to do with the cudaMemcpy operation. It means that your kernel is making an out-of-bounds access. You can run your code with cuda-memcheck to confirm this, you should see some output like “invalid global read of size 4” or something like that.

Since you haven’t shown your kernel code, obviously nobody can help you with that. But if you follow the technique described here:

You can probably narrow it down to a single offending line in your kernel.

Thanks for reply! Actually I find out that when I decrease amount of operations (number of blocks), it works. And on some stage there is cudaErrorLaunchTimeout. So my problem is: how can I force Cuda kernel to work, not throwing timeout exception?

seems like a different problem.

this may be of interest:

I think I figured out what is it - I declared arrays inside global kernel. There was simply stack overflow. Thanks

Since you mentioned that you are new to C and coming from a Matlab background, keep in mind that the sizes one uses in malloc() memcpy() as well as their CUDA counterparts are in bytes so, to allocate an array of 10 floats, use 10*sizeof(float) not just 10. Obvioulsy, I can’t see the how you compute mem_size_out variable.