I’m working on parallelizing a computation to split the work between multiple GPUs. I have several questions about how to do this, and in particular how to most efficiently transfer data between devices.
First, should I use a single cl_context with multiple devices, or should I use a separate context for each device?
If I use multiple devices with one context, what happens when I create a buffer object? Most of my data is specific to a particular device, not shared. When I create a buffer for a context with multiple devices, does it actually allocate memory on every one of the devices? How can I tell it to only allocate memory on one device?
Currently, I’m transferring data by downloading it to host memory, then uploading it to each device. It looks like CUDA 4.0 will possibly make direct device-to-device transfers possible, but that doesn’t appear to be available in OpenCL. Is that correct?
The OpenCL Programming Guide suggests the performance may be better if I do the transfers to and from page-locked memory. It also says you can’t allocate page-locked memory directly, but if you create a buffer with the CL_MEM_ALLOC_HOST_PTR flag, that will probably use page-locked memory. But I have to specify a cl_context when I create the buffer. If I use a separate context for each device, how can I allocate page-locked memory that can be used for downloading data from one device then uploading it to a different device?