Asynchronous IO for large images

I am interested in processing very large images that will not fit into
neither main CPU memory nor GPU memory.

So, I want to the the following:

  1. Read a tile asyncronoulsy from disk
  2. Copy it from the CPU memory to GPU memory
  3. Process the tile with CUDA kernel

Can anybody provide an example of how to do this
in the most efficient way so that disk IO,
copying from CPU to GPU memory and processing
can be overlapped/interleaved.

Any ideas/examples are appreciated.

Look at the simpleStreams sample in the CUDA SDK. It shows how to overlap GPU processing with a copy from GPU to CPU. You can use the same idea, but add CPU2GPU memcopies.