memory sharing in a multi-gpu environment

If I understand OpenCL correctly then when I create a single context with multiple devices the memory is allocated on all devices and its content is shared between the devices. But is there any way how to make the content of the memory non-shared?

Imagine the following case:

  1. Input is some 3D data = memory object A = shared between devices
  2. Each GPU launches a kernel that is supposed to process one 2D slice from the memory A
  3. Now, each kernel needs to store some intermediate results (2D data) in a temporary memory B
  4. Each GPU launches a second kernel that processes data from the memory B and stores the results back to the memory A

Now the problem is that I really don’t want the memory B to be shared between the devices as it is used to store only some local data that are specific for each device. Plus if the memory B is shared then it would be necessary to allocate it as a 3D array where the height is the number of devices and there would be some synchronization overhead even when the synchronization is not really needed.

Is this correct? I’m pretty new to OpenCL so maybe some of my assumptions are wrong.
If it is correct, how would you solve this problem? Using the 3D array for temporary results or using multiple contexts? (or using CUDA :) )


The content of such memory is not shared. There’s currently no easy and fast way to share GPU memory - you need to explicitly copy it back to host, sync on host, and copy back to device.

Allocating a buffer on a context that has many devices will simply allocate a copy of the buffer on each device separately. There’s no implicit sharing.

EDIT: Has been proven wrong, don’t quote ;)


I’ve read about the sharing in some powerpoint presentation and it was clearly wrong.

I tried to find some multi-gpu demo in the OpenCL SDK and I’ve found a project called oclSimpleMultiGPU. But after examining it I’m probably even more confused then before.

You said that when a buffer is created it is allocated on all devices on the context. But what they do in the project is that they actually create the buffer for each device of the context separately (to be more precise they call clCreateBuffer N-times where N is the number of devices). So this example suggests that the buffer is not allocated on all devices but only on a single one, which is kind of strange since clCreateBuffer does not take any device id as an input, so maybe the memory is allocated only after it is needed by the device?

Unfortunately the official Khronos OpenCL documentation does not seem to address this issue at all. There is actually a section about sharing of the OpenCL objects (appendinx A.1) but it only provides some suggestions but it does not explain the memory model at all

This is indeed odd, this example from SDK. I have no idea why would that work since, as you noticed, device_id isn’t specified anywhere near clCreateBuffer.

I’ve been wondering about this earlier and I’ve asked this question here

Do you have multiple devices that you could test it with? It would be easy to modify the SDK example to use a single buffer allocated for the thole context and see if that works.

Unfortunately not right now. We plan to use multiple GPUs in our application but we are still waiting for the next generation of GPUs before we invest any money.

I googled a little bit and I have found some new interesting info:

  1. The mentioned presentation that states that the memory objects are shared between devices:…-OpenCL-API.pdf

Actually at slide #14 they explicitly state that the memory IS copied between devices.

  1. The same is also discussed in this thread:…f=37&t=2133

It is also suggested that the memory is allocated only when the particular device actually needs the memory.

And even Appendix A.1 of the official Khronos OpenCL specs indicates that the memory indeed shared, I just wish they could describe the whole multi-devices behavior more clearly.

All of this is consistent with the project sample in the NVIDIA’s SDK as they create one buffer for each device (that would make no sense if they were not shared) and in case the memory is allocated only when needed then each of the devices allocates only what it needs.

Now I just wish someone from NVIDIA could confirm that it really works in this manner.

Now that’s news to me. Thank you for the sources.