I want to use a single large array (d_Output) stored in global device memory that different kernels can access to store flattened images of varying size. I know the sizes of the images before allocating the array at start, and want to read and write to it by using the relative position (the accumulated size of all previous images) in reference to the pointer address to the array. For CPU or embedded devices this is straight forward.
When trying this with CUDA I get errors. I know that the content of device memory must be transferred to the host for access, but there should be some way to define a location in an array that you can store to and later retrieve when copying back to the host. Is this trick impossible to use with CUDA or am I doing something wrong?
The alternatives as I see it would be to have separate variables for each image (tedious) or using a 2D array that has the largest flattened image size as pitch size (wasteful). I have tried using unified memory, but my screen blacks out so I have reverted to device/host setup.
An extract of what I want to do is shown below:
int images=10; //10 1D images of varying size
int *h_imSize = (int *)malloc(images * sizeof(int)); //array for storing image sizes
// Code for adding the sizes of the images to h_imsize
int *h_accuImSize = (int *)malloc(images * sizeof(int)); //array for storing the accumulated image sizes
thrust::inclusive_scan(h_imSize, h_imSize + images, h_accuImSize); //inclusive scan to get accumulated size
float *d_Output;
cudaMalloc((void **)&d_Output, h_accuImSize[images - 1] * sizeof(float)); //last element contains total size
myKernel <<< numBlocks, threadsPerBlock >>> (d_Output+h_accuImSize[3]); //writes to device memory with relative address
float *h_Output2= (float *)malloc(imageSize[3]*sizeof(float));
cudaMemcpy(h_Output1, d_Output + h_accuImSize[3], h_imSize * sizeof(float), cudaMemcpyDeviceToHost); //transfers from relative adress of device memory to host memory for further processing