Data transfer between multiple GPUs How to do it fast ?

My application needs to be split into 2 GPUs (using GTX 295) to double performance.
The specifics of the algorithm are such that the second step of it can be divided into 2 independent parts, but both halfs require same set of intermediate data from the first step.

Therefore, I have 2 possibilities:

  1. Perform redundant computation of intermediate data set on both GPUs. As a result, performance will not scale by a factor of 2 with having 2 GPUs.
  2. Split step 1 between 2 GPUs and let them exchange missing chunks of intermediate data between each other.

I would like to achieve performance scaling, so I need to follow possibility 2).

However, the data exchange is significant so that if I use pageable memory pointer to perform data exchange, I will have big performance penalty (no async data transfer).

I really need to use page-locked memory and async data transfer.
I use runtime API. I use 2 pthreads, each thread using 1 GPU.

If I understand correctly, page-locked pointers assigned within a thread are pinned to a specific GPU. So in order to exchange data between GPU’s I have to run a separate thread that will exchange data between page-locked pointers related to 2 GPUs. This is possible in principle but adds substantial complexity to the implementation.

Am I missing any “right” way to exchange data between GPUs, using async data transfer mode ?
I read in some forum from a year ago that future versions of CUDA can allow for page-locked pointers shared between GPUs. Was it implemented ?
Is there any special possibility for GTX 295 (which has 2 GPUs in the same box!) ?

Any suggestion would be appreciated…


use portable pinned memory–cudaHostAlloc, I think, is the function you want.

See section in the 3.0 Programming Guide:
“A block of page-locked memory can be used by any host threads, but by default, the
benefits of using page-locked memory described above are only available for the
thread that allocates it. To make these advantages available to all threads, it needs to
be allocated by passing flag cudaHostAllocPortable to cudaHostAlloc().”

So you shouldn’t have a problem.

more on pinned memory there :

Thanks, this is a solution.

The problem was that we’re using CUDA 2.0. We need to switch to the latest revision.