My application needs to be split into 2 GPUs (using GTX 295) to double performance.
The specifics of the algorithm are such that the second step of it can be divided into 2 independent parts, but both halfs require same set of intermediate data from the first step.
Therefore, I have 2 possibilities:
- Perform redundant computation of intermediate data set on both GPUs. As a result, performance will not scale by a factor of 2 with having 2 GPUs.
- Split step 1 between 2 GPUs and let them exchange missing chunks of intermediate data between each other.
I would like to achieve performance scaling, so I need to follow possibility 2).
However, the data exchange is significant so that if I use pageable memory pointer to perform data exchange, I will have big performance penalty (no async data transfer).
I really need to use page-locked memory and async data transfer.
I use runtime API. I use 2 pthreads, each thread using 1 GPU.
If I understand correctly, page-locked pointers assigned within a thread are pinned to a specific GPU. So in order to exchange data between GPU’s I have to run a separate thread that will exchange data between page-locked pointers related to 2 GPUs. This is possible in principle but adds substantial complexity to the implementation.
Am I missing any “right” way to exchange data between GPUs, using async data transfer mode ?
I read in some forum from a year ago that future versions of CUDA can allow for page-locked pointers shared between GPUs. Was it implemented ?
Is there any special possibility for GTX 295 (which has 2 GPUs in the same box!) ?
Any suggestion would be appreciated…
Thanks