Inter-GPU comunication

ibird · February 4, 2011, 11:18am

I am trying to comunicate from GPU to GPU with OpenCL, i has tried to Mapped (device 1) + than pass the to clEnqueueWrite on device 2
i has tried directly clEnqueueCopyBuffer() with GPU 1 and GPU 2 (on the same context), i has tried also
clEnqueueCopyBuffer (device 1) ----> Host ----> clEnqueueCopyBuffer ( device 2 ). The schema is N kernel execution + transfert , than again N execution + transfert
trasnfert are (4 block of 300KB ). Now if i transfert from (Device 1) —> (Host) -----> (Device 1) i am getting good result above about 2GB/s, but i i try
(Device 1) -----> Host -------> (Device 2) i get 50MB/s of bandwidth on 2 Tesla C1060 (no way to overcome the problem with the other methods).

I don’ t find any documentation on how transfert from GPU to GPU on OpenCL, now someone who has tried or from Nvidia, can explain or give documentation on how to transfert
from GPU to GPU efficently on OpenCL ?

atlruds · May 2, 2011, 11:23am

I am trying to comunicate from GPU to GPU with OpenCL, i has tried to Mapped (device 1) + than pass the to clEnqueueWrite on device 2

i has tried directly clEnqueueCopyBuffer() with GPU 1 and GPU 2 (on the same context), i has tried also

clEnqueueCopyBuffer (device 1) ----> Host ----> clEnqueueCopyBuffer ( device 2 ). The schema is N kernel execution + transfert , than again N execution + transfert

trasnfert are (4 block of 300KB ). Now if i transfert from (Device 1) —> (Host) -----> (Device 1) i am getting good result above about 2GB/s, but i i try

(Device 1) -----> Host -------> (Device 2) i get 50MB/s of bandwidth on 2 Tesla C1060 (no way to overcome the problem with the other methods).

I don’ t find any documentation on how transfert from GPU to GPU on OpenCL, now someone who has tried or from Nvidia, can explain or give documentation on how to transfert

from GPU to GPU efficently on OpenCL ?

I am looking for the same answer, but cannot find it anywhere. My application needs to split up a large dataset, which does not fit on a single device, among multiple devices and ghost layers have to be communicated between the devices for each iteration. Which command should be used to copy parts of a buffer on one device to some part of another device’s buffer?

philipjfry · May 2, 2011, 1:16pm

The OpenCL specs are in deed not very clear or detailed in this respect, and I have to admit that I never experimented with shared memory objects. However, appendix A1 (in the latest 1.0 and 1.1 specs) says it is fine to share memory objects between command queues (which IMHO implies multiple devices), but one has to ensure appropriate synchronization.

One should differentiate between CUDA device memory and OpenCL memory objects, the latter is a much more abstract concept. Somewhere under the hood an OpenCL runtime probably be required to copy shared memory objects to maintain the semantics, but I cannot tell how efficient this is doneâ€”both in terms of “technical efficiency” (asynchronous or direct GPU-GPU transfers for newer cards, e.g.) as well as in “logical efficiency” (reducing the number of transfers to a necessary minimum).

Regarding atlruds’ question: The specs (here 1.0) are quite clear about modifying a single memory object at the same time by two command queues (only reading seems to be fine).

As long as you are using OpenCL 1.0, you will have to use separate memory objects to be processed on different GPUs concurrently. I cannot tell if it works nicely if you put all of them into a single context (for the memory objects you will need to exchange data between the GPUs), or if it is better to keep them separate and exchange explicitly through host memory. At the end, it’s quite cumbersome and similar to MPI programming, I would say. The subbuffers of OpenCL 1.1 would simplify that alot, but I would not bet on NVIDIA ever releasing it.

peastman · May 19, 2011, 12:31am

See this thread for some hopefully helpful suggestions:

Peter

Topic		Replies	Views
Transferring data between devices CUDA Programming and Performance	7	5421	August 10, 2011
Data transfer between CPU and GPU CUDA Programming and Performance	7	14301	January 30, 2012
mapping vs. copying CPU memory CUDA Programming and Performance	4	7171	April 20, 2011
how are 'device' buffers actually allocated with multiple devices in a context clCreateBuffe CUDA Programming and Performance	9	5048	December 14, 2011
CUDA/OpenCL runs multiple GPUs sequentially CUDA Programming and Performance	16	19375	November 26, 2015
performance question CUDA Programming and Performance	9	9945	August 4, 2010
Running same kernel on multiple devices Spliting the same task on multiple devices CUDA Programming and Performance	6	6485	October 23, 2009
P2P communication between GPU and FPGA CUDA Programming and Performance cuda	3	2028	December 6, 2022
memory sharing in a multi-gpu environment CUDA Programming and Performance	7	6694	April 4, 2010
Overlapping kernel execution and memory copy CUDA Programming and Performance	6	9746	September 22, 2007

Inter-GPU comunication

Related topics