multi-GPUs with streams. Seems only one device overlapping copies

Windows 7 x64 with 2 GTX Titan X GPUs, one using TCC driver and one in WDDM connected to the display.

Splitting up a problem over two GPUs using streams to copy over the input data and launch batches of kernels.

The problem is that when I profile via NVVP it appears that only one of the two GPUs are overlapping the host-to-device copies with kernel launches.

This image set shows the issue ( NOTE: GPU 1 is the top image (WDDM connected to display) and GPU 0 is TCC);

GPU 0 has 16 streams mapped to it and GPU 1 has 6 streams.
Each each does a host-to-device copy per stream(from pinned host memory to device)
a pre-processing kernel per stream
and a batch of large kernels per stream (4 to 10)

GPU 0 is 8% faster than GPU 1 so I give GPU 0 a bit more work.
Have adjusted the number of streams for GPU 1 from 6 to 20, which did not resolve this issue.

Also this is not a situation where I can take advantage of the 2 copy engines, because the output buffer for each GPU is being updated by all streams and has to wait for all updates to be done before doing a cudaMemcpyPeer() from device 1 to device 0 for the final post-processing step.

Since at least one GPU is behaving correctly, I think that I have the code implemented correctly, but maybe I am missing something.

My questions are as follows;

  1. Is there a some upper bound or limit on the number of streams the host can handle?
  2. Can only one host-to-device copy involving pinned memory be performed at one time?
  3. Do need to change any system environment variables such as CUDA_DEVICE_MAX_CONNECTIONS (which is set to 8 by default) ?
  4. Is there any reference code which gives an example of such an application?
  5. Is there any possible issue with the WDDM driver and the fact GPU 1 is connected to the display?

I could not find any examples for a similar multi-GPU sample implementation, so any advice would be appreciated.

truly not easy questions

“1) Is there a some upper bound or limit on the number of streams the host can handle?”

yes, and no.
in my mind, it depends on the method of synchronization the host implements
the guides delineate synchronous and asynchronous memory copies
i think the concept can be extended to synchronization - some synchronization mechanisms seem to be ‘dead’ synchronous, others seem to be more flexible
if the host is to stop minding everything else, and mind/ wait for a single ‘event’, it may cause repercussions, particularly when it is likely to wait long, without having forward-tended to other ‘things’ first

“2) Can only one host-to-device copy involving pinned memory be performed at one time?”

a) if you take the hardware perspective as at least a memory perspective, as well as the pci bus perspective:
from a memory perspective, if there is no overlap, i do not see why not
from a pci bus perspective, it may simply be a case of being able to secure a channel; i am not entirely certain how many unique transfers can be conducted/ managed in the same direction, simultaneously
njuffa designed the memory system and the pci bus, he should know best

“5) Is there any possible issue with the WDDM driver and the fact GPU 1 is connected to the display?”

a thought: if wddm bundles work, and releases bundled work, and if the host implements more of an asynchronous synchronization method, it should have a higher propensity to get work bundled and out the door more quickly, such that it becomes less of a concern

streams are device-local
hence, how do you synchronize between devices, if necessary, and on the host?
you seem to be utilizing cudaDeviceSynchronize(), which raises some concerns

Not sure whether this is some sort of humorous remark or meant seriously, but for the record: I have not worked on any hardware designs since my AMD Athlon processor days fifteen years ago, and that work was focused on the FPU. My knowledge of PCIe and modern x86 on-chip memory controllers is very limited.

I know that each PCIe point-to-point connection is a full duplex link, so a given device can receive data from and send data to system memory simultaneously. Obviously this also requires that the device has two DMA controllers, one each per transfer direction.

Whether the PCIe controllers of typical x86 systems allow multiple devices to transfer data to / from system memory in the same direction at the same time I do not know. Presumably this requires at minimum no shared system memory address space (on a page boundary, so including false sharing), but even that is just intelligent speculation.

“Whether the PCIe controllers of typical x86 systems allow multiple devices to transfer data to / from system memory in the same direction at the same time I do not know.”

they have buffers, don’t they; would this not then permit multiple devices to transfer at the same time?
also, don’t the pci controllers ‘inject’ the transfers as ordinary memory transactions, in a sense?
would the memory then really care which device is behind the transfer per se?
both the cpu and memory should simply see it as the pci controller wishing to push through a transaction, in my mind

CudaaduC, could you expand device 0 (context 1) (and retake the image)?

Here are two slightly better images of each GPU process via nvvp.

Again GPU 1 is the top image and GPU 0 is the bottom image;

I have tried different numbers of streams per gpu(adjusting the work per stream accordingly), and found that less streams (more work per stream) seems to result in less overlap.

Do not want to give up on this yet, as I just need to reduce the total running time by 37 ms in order to meet my objective.

@CudaaduC, are you using cudaMemcpyXXX() routines?

If so, have you considered writing your own “data mover” kernels and running them concurrently?

I found this paper illuminating: Data Transfer Matters for GPU Computing

The relevant plots are marked “IORW”.

@allanmac -> Thanks! will try to digest that information and see if it gives me any useful insight.

I am using cudaMemcpyAsync() as you mentioned.

Another interesting fact which I noticed in the nvprof output is that it seems to be splitting the PCI-E pinned bandwidth between the two GPUs;

1.56322s  6.2476ms                    -               -         -         -         -  73.125MB  11.430GB/s  GeForce GTX TIT         2        39  [CUDA memcpy HtoD]
1.56630s  29.948ms                    -               -         -         -         -  73.125MB  2.3845GB/s  GeForce GTX TIT         1        17  [CUDA memcpy HtoD]
1.56632s  617.20us           (959 16 1)       (256 1 1)        16  9.7539KB        0B         -           -  GeForce GTX TIT         1        13  compress_y_half(float const *, __half2*, int) [251]
1.56947s  6.2463ms                    -               -         -         -         -  73.125MB  11.433GB/s  GeForce GTX TIT         2        40  [CUDA memcpy HtoD]
1.57378s  29.935ms                    -               -         -         -         -  73.125MB  2.3855GB/s  GeForce GTX TIT         1        20  [CUDA memcpy HtoD]
1.57379s  615.95us           (959 16 1)       (256 1 1)        16  9.7539KB        0B         -           -  GeForce GTX TIT         1        16  compress_y_half(float const *, __half2*, int) [266]
1.57572s  6.2428ms                    -               -         -         -         -  73.125MB  11.439GB/s  GeForce GTX TIT         2        41  [CUDA memcpy HtoD]
1.57589s  29.919ms                    -               -         -         -         -  73.125MB  2.3868GB/s  GeForce GTX TIT         1        18  [CUDA memcpy HtoD]
1.57590s  616.24us           (959 16 1)       (256 1 1)        16  9.7539KB        0B         -           -  GeForce GTX TIT         1        14  compress_y_half(float const *, __half2*, int) [256]
1.57694s  29.910ms                    -               -         -         -         -  73.125MB  2.3875GB/s  GeForce GTX TIT         1        19  [CUDA memcpy HtoD]
1.57696s  616.75us           (959 16 1)       (256 1 1)        16  9.7539KB        0B         -           -  GeForce GTX TIT         1        15  compress_y_half(float const *, __half2*, int) [261]
1.58196s  6.2474ms                    -               -         -         -         -  73.125MB  11.431GB/s  GeForce GTX TIT         2        42  [CUDA memcpy HtoD]
1.58821s  6.2408ms                    -               -         -         -         -  73.125MB  11.443GB/s  GeForce GTX TIT         2        43  [CUDA memcpy HtoD]
1.59445s  6.2454ms                    -               -         -         -         -  73.125MB  11.434GB/s  GeForce GTX TIT         2        44  [CUDA memcpy HtoD]
1.59625s  29.868ms                    -               -         -         -         -  73.125MB  2.3909GB/s  GeForce GTX TIT         1        21  [CUDA memcpy HtoD]
1.59626s  617.11us           (959 16 1)       (256 1 1)        16  9.7539KB        0B         -           -  GeForce GTX TIT         1        17  compress_y_half(float const *, __half2*, int) [271]

So when it performs the host-device copies from pinned host to GPU 1 it performs them at ~11.4 GBs and they seems to serialize, while the host-device from pinned host to GPU 0 it performs them at ~2.5 GBs and they seem to overlap.

Since the max pinned host-to-device copy speed in practice is usually around 13 GBs, my guess is that that total available bandwidth is being split/shared between the two in a lopsided fashion?

Based on that nvprof log it looks like you’re already fully utilizing the PCIe bus – i.e. you’re copying a LOT of data from host to device.

For this reason I don’t think my earlier suggestion will help at all. :(

I am not sure how to interpret the detailed numbers shown above. As a general remark, the overall throughput of a PCIe link (I don’t like the term “bus” as PCIe is a point-to-point interconnect, not a multi-drop bus) increases with the size of individual transfers, up to a size of about 16 MB where it levels off. Certainly a transfer rate of 11.4 GB/sec reflects full PCIe gen3 throughput.

So it may make sense to structure transfers such that the size of individual transfers is maximized. I understand that this may conflict with copy/kernel overlap. It seems like the application is sufficiently optimized that there are multiple bottlenecks in close vicinity of each other.

Since the app is apparently within epsilon of the targeted performance: I am wondering whether power consumption and thermals are optimally controlled to allow maximum clock boosting for the entire duration of the application run.

what is the ave memory transaction (h2d copy) time compared to the kernel execution time?
at first i thought you have ‘short’ kernels, but now i am beginning to think you have lengthy transfer transactions
the profile output supports this, but a real figure would be welcome

from the profile, i get the idea that memory transactions are overly prioritized
it seems certain kernels are essentially delayed by memory transactions; if it is not by your devise, the goal would be to limit this
for device 1, no kernel commences, before all memory transactions are completed, is this essentially required/ desired?
if you consider the point at which the device is kernel-saturated, the same applies to device 0

you issue the work in a loop, not so?
i am beginning to think that you first issue memory transactions, then launch kernels
i generally find that an interleaved approach works better

for (cnt)

for (cnt)


for (cnt)

the other approach is to stream branch
have limited, dedicated transfer streams per direction per device
issue all transfers in these
have the compute streams branch from the transfer streams via events