multi-GPUs with streams. Seems only one device overlapping copies

CudaaduC · October 29, 2015, 2:31am

Windows 7 x64 with 2 GTX Titan X GPUs, one using TCC driver and one in WDDM connected to the display.

Splitting up a problem over two GPUs using streams to copy over the input data and launch batches of kernels.

The problem is that when I profile via NVVP it appears that only one of the two GPUs are overlapping the host-to-device copies with kernel launches.

This image set shows the issue ( NOTE: GPU 1 is the top image (WDDM connected to display) and GPU 0 is TCC);

[url]http://imgur.com/a/XOTDc[/url]

GPU 0 has 16 streams mapped to it and GPU 1 has 6 streams.
Each each does a host-to-device copy per stream(from pinned host memory to device)
a pre-processing kernel per stream
and a batch of large kernels per stream (4 to 10)

GPU 0 is 8% faster than GPU 1 so I give GPU 0 a bit more work.
Have adjusted the number of streams for GPU 1 from 6 to 20, which did not resolve this issue.

Also this is not a situation where I can take advantage of the 2 copy engines, because the output buffer for each GPU is being updated by all streams and has to wait for all updates to be done before doing a cudaMemcpyPeer() from device 1 to device 0 for the final post-processing step.

Since at least one GPU is behaving correctly, I think that I have the code implemented correctly, but maybe I am missing something.

My questions are as follows;

Is there a some upper bound or limit on the number of streams the host can handle?
Can only one host-to-device copy involving pinned memory be performed at one time?
Do need to change any system environment variables such as CUDA_DEVICE_MAX_CONNECTIONS (which is set to 8 by default) ?
Is there any reference code which gives an example of such an application?
Is there any possible issue with the WDDM driver and the fact GPU 1 is connected to the display?

I could not find any examples for a similar multi-GPU sample implementation, so any advice would be appreciated.

little_jimmy · October 29, 2015, 4:50am

truly not easy questions

“1) Is there a some upper bound or limit on the number of streams the host can handle?”

yes, and no.
in my mind, it depends on the method of synchronization the host implements
the guides delineate synchronous and asynchronous memory copies
i think the concept can be extended to synchronization - some synchronization mechanisms seem to be ‘dead’ synchronous, others seem to be more flexible
if the host is to stop minding everything else, and mind/ wait for a single ‘event’, it may cause repercussions, particularly when it is likely to wait long, without having forward-tended to other ‘things’ first

“2) Can only one host-to-device copy involving pinned memory be performed at one time?”

a) if you take the hardware perspective as at least a memory perspective, as well as the pci bus perspective:
from a memory perspective, if there is no overlap, i do not see why not
from a pci bus perspective, it may simply be a case of being able to secure a channel; i am not entirely certain how many unique transfers can be conducted/ managed in the same direction, simultaneously
njuffa designed the memory system and the pci bus, he should know best

“5) Is there any possible issue with the WDDM driver and the fact GPU 1 is connected to the display?”

a thought: if wddm bundles work, and releases bundled work, and if the host implements more of an asynchronous synchronization method, it should have a higher propensity to get work bundled and out the door more quickly, such that it becomes less of a concern

streams are device-local
hence, how do you synchronize between devices, if necessary, and on the host?
you seem to be utilizing cudaDeviceSynchronize(), which raises some concerns

njuffa · October 29, 2015, 6:16am

Not sure whether this is some sort of humorous remark or meant seriously, but for the record: I have not worked on any hardware designs since my AMD Athlon processor days fifteen years ago, and that work was focused on the FPU. My knowledge of PCIe and modern x86 on-chip memory controllers is very limited.

I know that each PCIe point-to-point connection is a full duplex link, so a given device can receive data from and send data to system memory simultaneously. Obviously this also requires that the device has two DMA controllers, one each per transfer direction.

Whether the PCIe controllers of typical x86 systems allow multiple devices to transfer data to / from system memory in the same direction at the same time I do not know. Presumably this requires at minimum no shared system memory address space (on a page boundary, so including false sharing), but even that is just intelligent speculation.

little_jimmy · October 29, 2015, 9:41am

“Whether the PCIe controllers of typical x86 systems allow multiple devices to transfer data to / from system memory in the same direction at the same time I do not know.”

they have buffers, don’t they; would this not then permit multiple devices to transfer at the same time?
also, don’t the pci controllers ‘inject’ the transfers as ordinary memory transactions, in a sense?
would the memory then really care which device is behind the transfer per se?
both the cpu and memory should simply see it as the pci controller wishing to push through a transaction, in my mind

CudaaduC, could you expand device 0 (context 1) (and retake the image)?

CudaaduC · October 29, 2015, 8:40pm

Here are two slightly better images of each GPU process via nvvp.

Again GPU 1 is the top image and GPU 0 is the bottom image;

http://imgur.com/a/sSYZ1

I have tried different numbers of streams per gpu(adjusting the work per stream accordingly), and found that less streams (more work per stream) seems to result in less overlap.

Do not want to give up on this yet, as I just need to reduce the total running time by 37 ms in order to meet my objective.

allanmac · October 29, 2015, 8:56pm

@CudaaduC, are you using cudaMemcpyXXX() routines?

If so, have you considered writing your own “data mover” kernels and running them concurrently?

I found this paper illuminating: Data Transfer Matters for GPU Computing

The relevant plots are marked “IORW”.

CudaaduC · October 30, 2015, 1:18am

@allanmac → Thanks! will try to digest that information and see if it gives me any useful insight.

I am using cudaMemcpyAsync() as you mentioned.

Another interesting fact which I noticed in the nvprof output is that it seems to be splitting the PCI-E pinned bandwidth between the two GPUs;

1.56322s  6.2476ms                    -               -         -         -         -  73.125MB  11.430GB/s  GeForce GTX TIT         2        39  [CUDA memcpy HtoD]
1.56630s  29.948ms                    -               -         -         -         -  73.125MB  2.3845GB/s  GeForce GTX TIT         1        17  [CUDA memcpy HtoD]
1.56632s  617.20us           (959 16 1)       (256 1 1)        16  9.7539KB        0B         -           -  GeForce GTX TIT         1        13  compress_y_half(float const *, __half2*, int) [251]
1.56947s  6.2463ms                    -               -         -         -         -  73.125MB  11.433GB/s  GeForce GTX TIT         2        40  [CUDA memcpy HtoD]
1.57378s  29.935ms                    -               -         -         -         -  73.125MB  2.3855GB/s  GeForce GTX TIT         1        20  [CUDA memcpy HtoD]
1.57379s  615.95us           (959 16 1)       (256 1 1)        16  9.7539KB        0B         -           -  GeForce GTX TIT         1        16  compress_y_half(float const *, __half2*, int) [266]
1.57572s  6.2428ms                    -               -         -         -         -  73.125MB  11.439GB/s  GeForce GTX TIT         2        41  [CUDA memcpy HtoD]
1.57589s  29.919ms                    -               -         -         -         -  73.125MB  2.3868GB/s  GeForce GTX TIT         1        18  [CUDA memcpy HtoD]
1.57590s  616.24us           (959 16 1)       (256 1 1)        16  9.7539KB        0B         -           -  GeForce GTX TIT         1        14  compress_y_half(float const *, __half2*, int) [256]
1.57694s  29.910ms                    -               -         -         -         -  73.125MB  2.3875GB/s  GeForce GTX TIT         1        19  [CUDA memcpy HtoD]
1.57696s  616.75us           (959 16 1)       (256 1 1)        16  9.7539KB        0B         -           -  GeForce GTX TIT         1        15  compress_y_half(float const *, __half2*, int) [261]
1.58196s  6.2474ms                    -               -         -         -         -  73.125MB  11.431GB/s  GeForce GTX TIT         2        42  [CUDA memcpy HtoD]
1.58821s  6.2408ms                    -               -         -         -         -  73.125MB  11.443GB/s  GeForce GTX TIT         2        43  [CUDA memcpy HtoD]
1.59445s  6.2454ms                    -               -         -         -         -  73.125MB  11.434GB/s  GeForce GTX TIT         2        44  [CUDA memcpy HtoD]
1.59625s  29.868ms                    -               -         -         -         -  73.125MB  2.3909GB/s  GeForce GTX TIT         1        21  [CUDA memcpy HtoD]
1.59626s  617.11us           (959 16 1)       (256 1 1)        16  9.7539KB        0B         -           -  GeForce GTX TIT         1        17  compress_y_half(float const *, __half2*, int) [271]

So when it performs the host-device copies from pinned host to GPU 1 it performs them at ~11.4 GBs and they seems to serialize, while the host-device from pinned host to GPU 0 it performs them at ~2.5 GBs and they seem to overlap.

Since the max pinned host-to-device copy speed in practice is usually around 13 GBs, my guess is that that total available bandwidth is being split/shared between the two in a lopsided fashion?

allanmac · October 30, 2015, 1:32am

Based on that nvprof log it looks like you’re already fully utilizing the PCIe bus – i.e. you’re copying a LOT of data from host to device.

For this reason I don’t think my earlier suggestion will help at all. :(

njuffa · October 30, 2015, 2:12am

I am not sure how to interpret the detailed numbers shown above. As a general remark, the overall throughput of a PCIe link (I don’t like the term “bus” as PCIe is a point-to-point interconnect, not a multi-drop bus) increases with the size of individual transfers, up to a size of about 16 MB where it levels off. Certainly a transfer rate of 11.4 GB/sec reflects full PCIe gen3 throughput.

So it may make sense to structure transfers such that the size of individual transfers is maximized. I understand that this may conflict with copy/kernel overlap. It seems like the application is sufficiently optimized that there are multiple bottlenecks in close vicinity of each other.

Since the app is apparently within epsilon of the targeted performance: I am wondering whether power consumption and thermals are optimally controlled to allow maximum clock boosting for the entire duration of the application run.

little_jimmy · October 30, 2015, 5:11am

what is the ave memory transaction (h2d copy) time compared to the kernel execution time?
at first i thought you have ‘short’ kernels, but now i am beginning to think you have lengthy transfer transactions
the profile output supports this, but a real figure would be welcome

from the profile, i get the idea that memory transactions are overly prioritized
it seems certain kernels are essentially delayed by memory transactions; if it is not by your devise, the goal would be to limit this
for device 1, no kernel commences, before all memory transactions are completed, is this essentially required/ desired?
if you consider the point at which the device is kernel-saturated, the same applies to device 0

you issue the work in a loop, not so?
i am beginning to think that you first issue memory transactions, then launch kernels
i generally find that an interleaved approach works better

for (cnt)
{
h2d
}

for (cnt)
{
kernel()
}

vs:

for (cnt)
{
h2d
kernel
}

the other approach is to stream branch
have limited, dedicated transfer streams per direction per device
issue all transfers in these
have the compute streams branch from the transfer streams via events

Topic		Replies	Views
Time intervals and non-concurrent in multi streaming CUDA Programming and Performance cuda	6	577	April 6, 2023
streams vs. direct use of zero copy memory CUDA Programming and Performance	14	13126	March 30, 2011
Slow memcpy performance in dual-CPU, 10 GPU system CUDA Programming and Performance cuda , nsight , gpu	24	2213	January 18, 2023
Multi stream multi GPU CUDA Programming and Performance cuda	9	1105	October 6, 2023
Unpredictable nature of GPU action timing in Nsight CUDA Programming and Performance	10	1203	October 18, 2015
performance problem CUDA Programming and Performance	2	609	July 16, 2018
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2227	January 18, 2023
Transfer data between host and device dynamicly? Maybe it's a problem. CUDA Programming and Performance	12	5264	April 2, 2008
Inconsistent concurrent transfer speed CUDA Programming and Performance	21	1184	April 17, 2023
Multiple GPUs Devise a synchro mechanism for host threads CUDA Programming and Performance	7	4199	May 13, 2010

multi-GPUs with streams. Seems only one device overlapping copies

Related topics