Relative addressing in CUDA device memory

I want to use a single large array (d_Output) stored in global device memory that different kernels can access to store flattened images of varying size. I know the sizes of the images before allocating the array at start, and want to read and write to it by using the relative position (the accumulated size of all previous images) in reference to the pointer address to the array. For CPU or embedded devices this is straight forward.

When trying this with CUDA I get errors. I know that the content of device memory must be transferred to the host for access, but there should be some way to define a location in an array that you can store to and later retrieve when copying back to the host. Is this trick impossible to use with CUDA or am I doing something wrong?

The alternatives as I see it would be to have separate variables for each image (tedious) or using a 2D array that has the largest flattened image size as pitch size (wasteful). I have tried using unified memory, but my screen blacks out so I have reverted to device/host setup.

An extract of what I want to do is shown below:

int images=10; //10 1D images of varying size
int *h_imSize = (int *)malloc(images * sizeof(int)); //array for storing image sizes
// Code for adding the sizes of the images to h_imsize
int *h_accuImSize = (int *)malloc(images * sizeof(int)); //array for storing the accumulated image sizes
thrust::inclusive_scan(h_imSize, h_imSize + images, h_accuImSize); //inclusive scan to get accumulated size

float *d_Output;
cudaMalloc((void **)&d_Output, h_accuImSize[images - 1] * sizeof(float)); //last element contains total size

myKernel <<<  numBlocks, threadsPerBlock >>> (d_Output+h_accuImSize[3]); //writes to device memory with relative address

float *h_Output2= (float *)malloc(imageSize[3]*sizeof(float));
cudaMemcpy(h_Output1, d_Output + h_accuImSize[3], h_imSize * sizeof(float), cudaMemcpyDeviceToHost); //transfers from relative adress of device memory to host memory for further processing

Your method should work fine. Saying “When trying this with CUDA I get errors” doesn’t give others much useful information to try and help you.

There are probably some errors with the code you have shown, but since it is incomplete, I’m not really sure if this is actually the code you are using. For example:

h_imSize * sizeof(float),

is not the right way to get the image size. h_imSize is an array (a pointer, technically), and you have not identified an array index. Probably you want something like:

h_imSize[3] * sizeof(float),

I’m not going to try and discover all of those types of errors, unless you want to provide a complete code that someone else could test.

I wrote the post without having the possiblity to check the code or the specific error message I got.

You are correct about the missing array index, but the question was more general whether I can use the relative addressing on device memory, and I think you confirmed that.