Transfer rates for mapped memory is driver involved?

Hi,
I’m trying to figure up what is the bottleneck for the transfer rate from/to two GTX480 cards.
I am using an X58 chipset that supports two x16 PCIe slots. The QPI has a limit of 9155 MiB/sec.
Using cudaMemcpy the max transfer rates are:
1 GPU: H2D 5729 MiB/sec D2H 6173 MiB/sec
2 GPUs: H2D 8225 MiB/sec D2H 6175 MiB/sec

I wrote a kernel that reads data from write-combined mapped memory.
The max transfer rates were:
2 GPUs: H2D 8211 MiB/sec D2H 6161 MiB/sec

Is the driver involved in the mapped-memory transfers, or is it purely hardware DMA?
Why am I not getting more bandwidth for D2H transfers?

Thanks

The data rates you show for the one-GPU case indicate your PCIe throughput is at the upper end of the range I have seen for PCIe gen2 x16 across various chipsets. Note that PCIe uses packetized transport, so the payload bandwidth is not the same as the raw bandwidth. A useful description of the PCIe efficiency at various packet sizes can be found here:

By “mapped memory” do you mean “pinned memory”? Transfers from / to pinned memory via cudaMemcpyAsync() are pure DMA transfers, but obviously the driver havs to set up the DMA transfers. There are no additional copies through a pinned staging buffer in host memory, like there typically are when non-pinned memory is used.

I assume your expectation was that simultaneous transfers from both cards through both PCIe ports combined should be able to saturate the QPI link, and that doesn’t seem to happen. I assume you have verified that the transfers from both cards do indeed occur simultaneously. I am not familiar with the X58 chipset. I do not know how it multiplexes transfers from the dual PCIe x16 ports onto QPI. Are there configuration options for that which you could try (maybe in the SBIOS)?

Thanks for the reply. By mapped memory I mean that the kernel reads directly form host memory that is ‘mapped’ (and pinned). Since the DMA engine is on the card, the driver may not be involved at all.

I ran with 10-30 thread blocks (same performance) and passed a different allocated mem chunk for each to read/write. I get the max performance with large chunks.

I didn’t modify bios options, didn’t find a relevant one. iommu is off. I verified that transfers are simultaneous.

What’s bothering me is the low D2H transfer rate for 2 GPUs, like these transfers are serialized.

Your description (kernel accesses host memory directly) indicates you are testing a zero-copy scenario. Sorry, I have no experience with that at all, and don’t have any insights into the underlying mechanism or know what kind of performance one should expect. I am aware that to maximize device/device bandwidth on Fermi-class GPUs it is best to run many thread blocks (at least several hundred, better thousands); whether that recommendation extends to device/host bandwidth in zero-copy situations, I do not know.

Am I misreading something, or doesn’t zero-copy play any role in this? As far as I read the numbers in the initial post, single GPU DMA, double GPU DMA, and zero-copy all reach exactly the same DtoH bandwidth which seems to be at the upper limit of reported PCIe GPU bandwidths (as Norbert pointed out already). I cannot precisely pinpoint where the limit stems from and why it is a bit smaller than accumulated multi-GPU HtoD bandwidth, but I’d expect until the advent of PCIe3 GPUs, mainboards, and suitably dimensioned chipsets/CPUs, this is the maximum one can expect.

So, to my question - can it be a software (driver) problem or is it purely hardware?
And tera, if ~6.2GB/s is the PCIe2 x16 limit, shouldn’t the bottleneck for two PCIe2 x16 be somewhere else, and higher?

It might e.g. be in the chipset, yes.