memory sharing in a multi-gpu environment

o.stava · January 25, 2010, 4:21pm

If I understand OpenCL correctly then when I create a single context with multiple devices the memory is allocated on all devices and its content is shared between the devices. But is there any way how to make the content of the memory non-shared?

Imagine the following case:

Input is some 3D data = memory object A = shared between devices
Each GPU launches a kernel that is supposed to process one 2D slice from the memory A
Now, each kernel needs to store some intermediate results (2D data) in a temporary memory B
Each GPU launches a second kernel that processes data from the memory B and stores the results back to the memory A

Now the problem is that I really don’t want the memory B to be shared between the devices as it is used to store only some local data that are specific for each device. Plus if the memory B is shared then it would be necessary to allocate it as a 3D array where the height is the number of devices and there would be some synchronization overhead even when the synchronization is not really needed.

Is this correct? I’m pretty new to OpenCL so maybe some of my assumptions are wrong.
If it is correct, how would you solve this problem? Using the 3D array for temporary results or using multiple contexts? (or using CUDA :) )

Thanks

_Big_Mac · January 25, 2010, 4:47pm

The content of such memory is not shared. There’s currently no easy and fast way to share GPU memory - you need to explicitly copy it back to host, sync on host, and copy back to device.

Allocating a buffer on a context that has many devices will simply allocate a copy of the buffer on each device separately. There’s no implicit sharing.

EDIT: Has been proven wrong, don’t quote ;)

o.stava · January 25, 2010, 5:36pm

Thanks,

I’ve read about the sharing in some powerpoint presentation and it was clearly wrong.

I tried to find some multi-gpu demo in the OpenCL SDK and I’ve found a project called oclSimpleMultiGPU. But after examining it I’m probably even more confused then before.

You said that when a buffer is created it is allocated on all devices on the context. But what they do in the project is that they actually create the buffer for each device of the context separately (to be more precise they call clCreateBuffer N-times where N is the number of devices). So this example suggests that the buffer is not allocated on all devices but only on a single one, which is kind of strange since clCreateBuffer does not take any device id as an input, so maybe the memory is allocated only after it is needed by the device?

Unfortunately the official Khronos OpenCL documentation does not seem to address this issue at all. There is actually a section about sharing of the OpenCL objects (appendinx A.1) but it only provides some suggestions but it does not explain the memory model at all

_Big_Mac · January 25, 2010, 5:42pm

This is indeed odd, this example from SDK. I have no idea why would that work since, as you noticed, device_id isn’t specified anywhere near clCreateBuffer.

I’ve been wondering about this earlier and I’ve asked this question here [url=“http://forums.nvidia.com/index.php?showtopic=153708”]The Official NVIDIA Forums | NVIDIA

Do you have multiple devices that you could test it with? It would be easy to modify the SDK example to use a single buffer allocated for the thole context and see if that works.

o.stava · January 25, 2010, 5:54pm

Unfortunately not right now. We plan to use multiple GPUs in our application but we are still waiting for the next generation of GPUs before we invest any money.

o.stava · January 25, 2010, 10:20pm

I googled a little bit and I have found some new interesting info:

The mentioned presentation that states that the memory objects are shared between devices:

http://gpgpu.org/wp/wp-content/uploads/200…-OpenCL-API.pdf

Actually at slide #14 they explicitly state that the memory IS copied between devices.

The same is also discussed in this thread:

[url=“http://www.khronos.org/message_boards/viewtopic.php?f=37&t=2133”]http://www.khronos.org/message_boards/view...f=37&t=2133[/url]

It is also suggested that the memory is allocated only when the particular device actually needs the memory.

And even Appendix A.1 of the official Khronos OpenCL specs indicates that the memory indeed shared, I just wish they could describe the whole multi-devices behavior more clearly.

All of this is consistent with the project sample in the NVIDIA’s SDK as they create one buffer for each device (that would make no sense if they were not shared) and in case the memory is allocated only when needed then each of the devices allocates only what it needs.

Now I just wish someone from NVIDIA could confirm that it really works in this manner.

_Big_Mac · January 25, 2010, 10:34pm

Now that’s news to me. Thank you for the sources.

Dr.Synth · April 4, 2010, 11:05am

External Media

Topic		Replies	Views
Transferring data between devices CUDA Programming and Performance	7	5428	August 10, 2011
cl_mem How is it shared across multiple devices in the context? CUDA Programming and Performance	0	3165	July 26, 2009
What happens when I allocate memory on multiple devices? CUDA Programming and Performance	3	5581	February 12, 2011
Running same kernel on multiple devices Spliting the same task on multiple devices CUDA Programming and Performance	6	6492	October 23, 2009
how are 'device' buffers actually allocated with multiple devices in a context clCreateBuffe CUDA Programming and Performance	9	5054	December 14, 2011
cudaMalloc and sharing between CPU threads CUDA Programming and Performance	0	4348	May 20, 2009
Application does not scale when using cl::Buffer-Object CUDA Programming and Performance	1	7541	March 18, 2011
How does clCreateBuffer actually work? We don't supply a cl_device_id CUDA Programming and Performance	2	7205	December 20, 2009
How to share GPU memory from different host threads? CUDA Programming and Performance	6	2338	July 14, 2010
How to share the same Device Memory between 2 process CUDA Programming and Performance	12	7484	October 28, 2009

memory sharing in a multi-gpu environment

Related topics