Transferring data between devices

I’m working on parallelizing a computation to split the work between multiple GPUs. I have several questions about how to do this, and in particular how to most efficiently transfer data between devices.

First, should I use a single cl_context with multiple devices, or should I use a separate context for each device?

If I use multiple devices with one context, what happens when I create a buffer object? Most of my data is specific to a particular device, not shared. When I create a buffer for a context with multiple devices, does it actually allocate memory on every one of the devices? How can I tell it to only allocate memory on one device?

Currently, I’m transferring data by downloading it to host memory, then uploading it to each device. It looks like CUDA 4.0 will possibly make direct device-to-device transfers possible, but that doesn’t appear to be available in OpenCL. Is that correct?

The OpenCL Programming Guide suggests the performance may be better if I do the transfers to and from page-locked memory. It also says you can’t allocate page-locked memory directly, but if you create a buffer with the CL_MEM_ALLOC_HOST_PTR flag, that will probably use page-locked memory. But I have to specify a cl_context when I create the buffer. If I use a separate context for each device, how can I allocate page-locked memory that can be used for downloading data from one device then uploading it to a different device?

Peter

Hi Peter,

answers are inlined:

I would recommend a single context with multiple devices. The reason for this is that you can share resources like buffers and events to synchronize and coordinate the execution between all devices of a single context.

In OpenCL buffer objects are associated with the context and not with a device as is the case with Cuda. The NVIDIA driver uses a lazy allocation strategy meaning that the memory on the device is allocated when the buffer is first associated with it (e.g. as kernel argument or in a read/write command). If a buffer is only used with one device it will also be only allocated on this device. However, the strategy of other vendors may differ here.

Yes, CUDA 4.0 allows direct device to device transfers on certain configurations. This is not available on OpenCL, correct.

Buffers are associated with contexts not devices. So for contexts with multiple devices the same buffer is available to be used on all devicesso without explicit copying. Under the hood the driver will take care of allocation and copying between devices.

-Timo

Interesting. What actually happens when a kernel writes to a shared buffer, then a different kernel on a different device reads from it? Will the data be transferred in a more efficient way than if I manually download it to host memory, then upload it again? Is that the most efficient way of getting the data from one device to another?

Peter

I think that that is the standard rather than the strategy.

Will work on 64bit setups NOT USING wddm, which means that it will not work under windows vista/7 unless running in tcc mode. And yes, does not work under OpenCL.

AFAIK, that will not work. OpenCL does not synchronize buffers between devices. You have to manually copy data to the CPU buffer and than to the next OpenCL device. I’m not even sure if the behavior of actually using the buffer under multiple devices is full defined.

By the way, paged locked memory is also context specific under CUDA (at least pre-4.0). You can allocated it to be recognized as such under all contexts (not sure what is the behavior with OpenCL), but once the context is destroyed, the paged locked memory is released as well. You are probably safer using a single context though so you are not implementation dependent.

Yes, this works. It is also fully valid according to the OpenCL spec, see Appendix A.1.

The application needs to make sure that shared resources are not used simultaneouly by means of OpenCL events or clFlush/clFinish otherwise the outcome is undefined.

That apendice talks about sharing object (including memory objects) between command queues, not devices. It says nothing about the data coherency of a memory object when shared between multiple devices. A memory object can be used by multiple queues (useful for concurrent copy and execute on parts of an image for example), but AFAIK there is no implicit coherency between multiple devices. This has to be done manually by the host via host memory. There are just too many devices that can’t support such functionallity and the pottential performance hits are huge. I doubt as well that the virtual memory unit only the GPU is complex enough to mark dirty pages and copy only those, and I doubt even more that DSPs even have something remotly like that. Even on the CPU that functionality is pretty limited. This means that you will need to perform a full memory copy each time. clEnqueueCopyBuffer are command queue specific as well, so can’t operate between devices.

I’m not even sure if it’s defined wether a memory object can be instansiated on multiple devices.

With some off-line help from Timo, I came up with a strategy that works pretty well:

  1. Use a separate context for every device.

  2. Create a buffer with CL_MEM_ALLOC_HOST_PTR so that it will used pinned memory.

  3. Call clEnqueueMapBuffer() to get a reference to that pinned memory. Do this only once, and don’t unmap it until you’re all done with it.

  4. Transfer data to and from that pinned memory with clEnqueueReadBuffer() and clEnqueueWriteBuffer().

Note that the pinned memory buffer is just being used as a block of host memory, and is referenced by pointer, not through a cl_mem object. That means we can use it with multiple devices, even though they’re using different contexts.

The details are very specific - small changes can make a big difference to the performance. For example, if you use clEnqueueCopyBuffer() instead of clEnqueueReadBuffer() and clEnqueueWriteBuffer(), that makes it much slower. According to Timo, there’s no good reason for it to be slower, and this problem will be fixed in a future driver update. Also, if I use a single context for all devices, it becomes much slower. That is probably also a driver bug. Timo doesn’t see this problem on his system, but it makes a big difference on mine (a Tesla S1070).

Peter

It seems to me that if the appendix refers to “share objects between command queues” we may assume that the command queues may be assigned to different devices. I mean, if there is such limitation it should be stated right on that appendix.

In respect to “There are just too many devices that can’t support such functionallity”, I think that such devices will simply not be OpenCL compliant.

Kind regards