Windows 7 x64 with 2 GTX Titan X GPUs, one using TCC driver and one in WDDM connected to the display.
Splitting up a problem over two GPUs using streams to copy over the input data and launch batches of kernels.
The problem is that when I profile via NVVP it appears that only one of the two GPUs are overlapping the host-to-device copies with kernel launches.
This image set shows the issue ( NOTE: GPU 1 is the top image (WDDM connected to display) and GPU 0 is TCC);
GPU 0 has 16 streams mapped to it and GPU 1 has 6 streams.
Each each does a host-to-device copy per stream(from pinned host memory to device)
a pre-processing kernel per stream
and a batch of large kernels per stream (4 to 10)
GPU 0 is 8% faster than GPU 1 so I give GPU 0 a bit more work.
Have adjusted the number of streams for GPU 1 from 6 to 20, which did not resolve this issue.
Also this is not a situation where I can take advantage of the 2 copy engines, because the output buffer for each GPU is being updated by all streams and has to wait for all updates to be done before doing a cudaMemcpyPeer() from device 1 to device 0 for the final post-processing step.
Since at least one GPU is behaving correctly, I think that I have the code implemented correctly, but maybe I am missing something.
My questions are as follows;
- Is there a some upper bound or limit on the number of streams the host can handle?
- Can only one host-to-device copy involving pinned memory be performed at one time?
- Do need to change any system environment variables such as CUDA_DEVICE_MAX_CONNECTIONS (which is set to 8 by default) ?
- Is there any reference code which gives an example of such an application?
- Is there any possible issue with the WDDM driver and the fact GPU 1 is connected to the display?
I could not find any examples for a similar multi-GPU sample implementation, so any advice would be appreciated.