How to apply multiple kernels to buffered work-items before proceeding to rest of the NDRange discus

I am porting a scientific computing application to use OpenCL for the main computational workload. Specifically, this is a Computational Fluid Dynamics solver, so the nature of the algorithm involves communication only with nearest neighbor ‘cells’ in the fluid.

In targeting modern GPU hardware, my algorithm is well suited in the sense that most communication is limited to neighboring work-items executed in the same work group. It is limited though, because the amount of floating point calculation per cell at a given time step is relatively small compared to the PCIe bandwidth and latency .

The issue:
Say I am running on a machine with 16 GB of RAM and equipped with a GPU with on-card 1GB of RAM.
The overall computation each time step is broken up into multiple sub-steps. I want each sub-step to be written as an OpenCL function – probably each a kernel function, because the order and specific functions is determined at run-time.
This means that if my problem size is 8GB, it is too big to all fit on the GPU buffer. I want all sub-steps to be executed using the cells that have already been buffered into GPU RAM, before swapping out the buffer over the PCIe bus.

In my reading of the OpenCL spec, including the memory map functions that have been discussed in this forum before, I don’t see a way to apply multiple kernels to memory buffers in part only.

I would like to call clEnqueueNDRangeKernel using the entire 8GB problem size. With a properly linked queue of kernels with dependencies established, it seems this will require executing each kernel on the entire 8GB problem using many PCIe bus transfers, before the next kernel is started, again with many unnecessary memory transfers.

I have considered two possible solutions, but I am mainly wondering if other OpenCL users have encountered this issue or can suggest something more elegant? I’m not as familiar with CUDA, but perhaps it addresses this issue?

One would be to add more complexity to my host code, such that the problem domain is broken into smaller parts that fit entirely on the GPU memory. Then I can execute a string of kernels on each smaller part, using the device queue, and presumably avoid unnecessary bus communication. This adde complexity would include managing neighbor cells at the splits – duplicating them using a technique commonly called ‘ghost cells’.

There may be another option using only one kernel function that some how dynamically builds a list of other OpenCL functions to be called for each work item. I understand that function pointers are not supported in the language, but I think something could still be worked out to implement this. Then, when a work-item is executed, multiple algorithm steps could be applied using that cell’s value and neighboring cells while still only calling clEnqueueNDRangeKernel once on the entire problem domain. I’ll admit, I just thought about this while writing, so I haven’t thought it all the way through.

Can anyone offer some advice or report having tried something else?