Inter-GPU comunication

I am trying to comunicate from GPU to GPU with OpenCL, i has tried to Mapped (device 1) + than pass the to clEnqueueWrite on device 2
i has tried directly clEnqueueCopyBuffer() with GPU 1 and GPU 2 (on the same context), i has tried also
clEnqueueCopyBuffer (device 1) ----> Host ----> clEnqueueCopyBuffer ( device 2 ). The schema is N kernel execution + transfert , than again N execution + transfert
trasnfert are (4 block of 300KB ). Now if i transfert from (Device 1) —> (Host) -----> (Device 1) i am getting good result above about 2GB/s, but i i try
(Device 1) -----> Host -------> (Device 2) i get 50MB/s of bandwidth on 2 Tesla C1060 (no way to overcome the problem with the other methods).

I don’ t find any documentation on how transfert from GPU to GPU on OpenCL, now someone who has tried or from Nvidia, can explain or give documentation on how to transfert
from GPU to GPU efficently on OpenCL ?

I am looking for the same answer, but cannot find it anywhere. My application needs to split up a large dataset, which does not fit on a single device, among multiple devices and ghost layers have to be communicated between the devices for each iteration. Which command should be used to copy parts of a buffer on one device to some part of another device’s buffer?

The OpenCL specs are in deed not very clear or detailed in this respect, and I have to admit that I never experimented with shared memory objects. However, appendix A1 (in the latest 1.0 and 1.1 specs) says it is fine to share memory objects between command queues (which IMHO implies multiple devices), but one has to ensure appropriate synchronization.

One should differentiate between CUDA device memory and OpenCL memory objects, the latter is a much more abstract concept. Somewhere under the hood an OpenCL runtime probably be required to copy shared memory objects to maintain the semantics, but I cannot tell how efficient this is done—both in terms of “technical efficiency” (asynchronous or direct GPU-GPU transfers for newer cards, e.g.) as well as in “logical efficiency” (reducing the number of transfers to a necessary minimum).

Regarding atlruds’ question: The specs (here 1.0) are quite clear about modifying a single memory object at the same time by two command queues (only reading seems to be fine).

As long as you are using OpenCL 1.0, you will have to use separate memory objects to be processed on different GPUs concurrently. I cannot tell if it works nicely if you put all of them into a single context (for the memory objects you will need to exchange data between the GPUs), or if it is better to keep them separate and exchange explicitly through host memory. At the end, it’s quite cumbersome and similar to MPI programming, I would say. The subbuffers of OpenCL 1.1 would simplify that alot, but I would not bet on NVIDIA ever releasing it.

See this thread for some hopefully helpful suggestions: