How to apply multiple kernels to buffered work-items before proceeding to rest of the NDRange discus

noah_r · August 6, 2010, 9:51pm

I am porting a scientific computing application to use OpenCL for the main computational workload. Specifically, this is a Computational Fluid Dynamics solver, so the nature of the algorithm involves communication only with nearest neighbor ‘cells’ in the fluid.

In targeting modern GPU hardware, my algorithm is well suited in the sense that most communication is limited to neighboring work-items executed in the same work group. It is limited though, because the amount of floating point calculation per cell at a given time step is relatively small compared to the PCIe bandwidth and latency .

The issue:
Say I am running on a machine with 16 GB of RAM and equipped with a GPU with on-card 1GB of RAM.
The overall computation each time step is broken up into multiple sub-steps. I want each sub-step to be written as an OpenCL function – probably each a kernel function, because the order and specific functions is determined at run-time.
This means that if my problem size is 8GB, it is too big to all fit on the GPU buffer. I want all sub-steps to be executed using the cells that have already been buffered into GPU RAM, before swapping out the buffer over the PCIe bus.

In my reading of the OpenCL spec, including the memory map functions that have been discussed in this forum before, I don’t see a way to apply multiple kernels to memory buffers in part only.

I would like to call clEnqueueNDRangeKernel using the entire 8GB problem size. With a properly linked queue of kernels with dependencies established, it seems this will require executing each kernel on the entire 8GB problem using many PCIe bus transfers, before the next kernel is started, again with many unnecessary memory transfers.

I have considered two possible solutions, but I am mainly wondering if other OpenCL users have encountered this issue or can suggest something more elegant? I’m not as familiar with CUDA, but perhaps it addresses this issue?

One would be to add more complexity to my host code, such that the problem domain is broken into smaller parts that fit entirely on the GPU memory. Then I can execute a string of kernels on each smaller part, using the device queue, and presumably avoid unnecessary bus communication. This adde complexity would include managing neighbor cells at the splits – duplicating them using a technique commonly called ‘ghost cells’.

There may be another option using only one kernel function that some how dynamically builds a list of other OpenCL functions to be called for each work item. I understand that function pointers are not supported in the language, but I think something could still be worked out to implement this. Then, when a work-item is executed, multiple algorithm steps could be applied using that cell’s value and neighboring cells while still only calling clEnqueueNDRangeKernel once on the entire problem domain. I’ll admit, I just thought about this while writing, so I haven’t thought it all the way through.

Can anyone offer some advice or report having tried something else?

Thanks.

-noah

Topic		Replies	Views
Running same kernel on multiple devices Spliting the same task on multiple devices CUDA Programming and Performance	6	6473	October 23, 2009
Transferring data between devices CUDA Programming and Performance	7	5387	August 10, 2011
How to handle CL_MEM_OBJECT_ALLOCATION_FAILURE errors if amount of useable memory is not known? CUDA Programming and Performance	8	15495	October 9, 2017
how are 'device' buffers actually allocated with multiple devices in a context clCreateBuffe CUDA Programming and Performance	9	5027	December 14, 2011
launch kernels in parallel? CUDA Programming and Performance	16	23990	July 29, 2010
reuse of kernel instance for back-to-back sets of work items design to juggle concurrent computation CUDA Programming and Performance	2	832	June 22, 2011
Inter-GPU comunication CUDA Programming and Performance	3	11450	May 19, 2011
Is task parallel programming (aka "concurrent kernels") in OpenCL supported? CUDA Programming and Performance	1	4287	August 31, 2011
CUDA/OpenCL runs multiple GPUs sequentially CUDA Programming and Performance	16	19341	November 26, 2015
local / global work (group) sizes and memory limit calculations How to find out how much private mem CUDA Programming and Performance	3	21950	August 15, 2011

How to apply multiple kernels to buffered work-items before proceeding to rest of the NDRange discus

Related topics