Point Processing of Huge Images using GPU

I’m a few days into learning CUDA programming (using Visual Studio 2008: 64-bit) and my main interest is in processing of huge satellite images (1GB>). There are numerous examples showing how to filter smallish images using tiles, but I can’t find any examples of how to point process HUGE images. For example, if I have an 8-bit image that is 50,000 x 50,000 (2.5 GB) and I just want to add a constant value to each pixel, how would I organize memory management in the GPU? For simplicity, assume I have the entire image in memory on the CPU side, would I send data to the GPU as tiles or as a single line or group of lines? Ultimately I have a complex function (very nasty) that I need to apply to each pixel. Any hints appreciated!



I don’t have too much of experience in the image processing, but have done some simple applications. In general if your filtering function is local, i.e. the value that modifies the given pixel is independent of the neighboring points, you can simply organize it as a 1D array (although you might use the 2D textures for a quicker fetching on GPU). If it all doesn’t fit into GPU’s memory then you’d simply divide it up into chunks that fit.

On the other hand, if your filter is non-local (let’s say involves calculation of differentials, smearing, etc), then you’ll need to minimize the number of reads and writes from global GPU memory. You’d want to divide it into overlapping tiles that fit into GPU memory. Then you’d load smaller sized (overlapping )tiles that fit let’s say into shared memory to avoid multiple reads of the same pixel from global memory. I did this for a filter with a second order derivative and it worked just fine (for large enough photos), especially if it is an iterative method, thus requires multiple passes before a CPU<->GPU transfer. Though the synchronization of the ghost areas is still a performance killer, would be ideal if you could do that asynchronously (while the interior parts are calculated). It seems that a TESLA card with 4GB of memory might be very useful for you.


What you want to do will be a challenge, in my experience. I do large image convolutions on a Tesla C1060, but by large, I mean like doing 20 1 megapixel 32bit images, using VS 2008, 64bit. One specific problem that I have run into is that if you are running under Windows, there is an upper limit to the maximum size of a single allocation on the CUDA device. In my case it turned out to be something like 1.7GB . Even though the C1060 has 4GB of memory, I cannot use it all for a single object.

In terms of what I do to convolve my 1 megapixel images, I:

  1. Create a DirectX texture on the CPU large enough for the image to be convolved.

  2. I create two 2D arrays on the CUDA device of the same pixel dimensions and pixel format, one for the image input, one for the buffered image output.

  3. I load the CPU texture with the image data.

  4. I copy the CPU texture bitmap array over to the CUDA input array.

  5. I GPU device convolve the CUDA input array into the CUDA output array. Two CUDA arrays are used so that it is easier to do non local convolutions without the output affecting the next input , etc.

  6. I copy the CUDA output array back to the CPU texture bitmap array.

  7. I display the CPU texture on a DirectX plane in 3D space. Since I’m doing neural nets, I have several planes of images where the image data represents the neural net node values.

I’m not sure if this will work for your large images. If you have to break the image into sub-images, it will be a challenge to do a non-local convolution at the boundariers of the sub-images.

Ken Chaffin