PCI Express x16 bandwidth - host<->device transfer Bandwidth is much lower than should be

I’m running on a P5KC ASUS motherboard with quad 2.4GHz Q6600 and NVidia 8600GTS. I’m seeing a peak of about 730MB/s transfer rate, both to and from the device. The motherboard is definitely PCI-Express x16 and so is the video card (I don’t think the 8600GTS comes in anything BUT pci express x16).

PCIE16 should transfer at around 4GB/s. What I’m seeing is closer to PCI Express x1. I’m wondering what other people are seeing for transfer rates, and if there is anything specific I should do to boost the speed. I googled around a bit, looked in my bios, looks like it should “Just work”.

I’m seeing the 700mb transfer speed both, in the “bandwidth test” CUDA sample as well as my own CUDA programs. They are terribly bandwidth bound now so I would really like to get to the bottom of this. Anyone know of a non-cuda application that can tst GPU transfer speed? I wonder if this problem is specific to cuda or not.

CUDA should give you faster transfer rates than OpenGL or DirectX.

700MiB/s is only a little lower than I get with normal memory on my box: 1074 host to device and 909.9 device to host.

You can get much faster transfer rates with pinned memory allocated via cudaMallocHost. The bandwithTest benchmark will use this memory if you give it the command line option --memory=pinned. Edit: forgot to mention that my box gets 2.5GiB/s with pinned memory in linux and 3 GiB/s in windows.

Thanks, this sounds very promising! I don’t remember reading about pinned memory in the CUDA user guide, but I will go back and study up on it.

While we’re on the subject of memory, it wasn’t entirely clear what happens with shared memory if you’re not explicitally allocating a shared buffer. IF you just have a simple program that say adds 2 buffers together, will that data end up in shared memory for the addition and then when the warp finishes, automatically synchronize it with global memory again? Or do you have to explitically alocate a shared memory array and copy out of global memory into that? I guess it doesn’t really matter if you’re only touching the data once, but i"m still curious what happens internally. The documentation doesn’t seem to cover this case.

Shared memory is only used when explicitly declared shared by the user. In your case of adding two buffers, the temporaries would be stored in registers. Registers are used for any variable in the kernel declared “normally”, i.e. “int a”, “float4 b”. The only exception is if you declare an array and access it by a variable index, this is put into slow local device memory as registers cannot be indexed.

Unfortunately pinned memory seems to be performing almost at the same rate as pageable. I even stepped through the code and cudaMallocHost is definitely being called. The other question, why is my pageable memory so slow if you were seeing a 1000+MB/s? Here is the result from bandwidth test.

Quick Mode
Host to Device Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 653.9

Quick Mode
Device to Host Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 815.5

Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 19008.1

And pageable:

Quick Mode
Host to Device Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 646.0

Quick Mode
Device to Host Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 804.6

Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 18976.7

did you use the command line option --memory=pinned?

Given that the output that was quoted says: “Device to Host Bandwidth for Pinned memory” it was set.

I’m a bit puzzled by the fact that pinned memory didn’t increase your performance at all. Perhaps NVIDIA can comment. Is the 8600 slower at memory transfers than the 8800 despite the PCIe x16 interface?

I also just looked up your MB at newegg. It has 2 PCIe slots (one x4 and the the other x16). Do you have two cards in? Which slot is your 8600 in?

Hmm, I also looked at the manual a few days ago, it looked like there was an x1 and two x16. Didn’t notice the x4 at all. I only have one card and it is in the lower of the two slots - could very well be the x4? I will just move it up to the next slot and see what happens. I"ll let you know. Thanks for taking a look!

I was just going off of the specifications mentioned at newegg. Obviously the manual is the authoritative source. Still, it doesn’t hurt to try moving the card up a slot.

Voila, that was the problem! I didn’t realize that a 16x device could fit into a x4 slot. I thought they were all physically different size.

Now I’m getting around 2GB/s transfer speed both in my own app and in the bandwidth test. This is still a far cry from the theoretical 4.0GB/s that PCI-E x16 should achieve. Sounds like everyone else is running at about this same speed (between 2 and 2.5GB/s) though. Is there an explanation about why?

The achieved rate depends on various aspects of your motherboard, chipset, memory system, etc. You’ll never get the theoretical rate because the PCIe bus has some overhead and you lose some of the bandwidth due to packet headers etc IIRC. I have a test machine that uses NVIDIA LinkBoost that hits 3900MB/sec (LinkBoost overclocks the PCIe bus…), so there are machines that go fast. PCIe 2 should be even faster yet when good motherboards come out.

John Stone

I agree that PCI-E x16 will never run at 4GB/s in practical reality due to various overhead, but overhead alone cannot account in a 50%+ loss in bandwidth.

I’m getting 2.5GB/s host to device and 1.9GB/s device->host. I have a brand new high end motherboard, all components are basically top of the line.

I guess it’s time to bring up another question. Why is host->device so much faster than device->host? PCI-E bus is bidirectional and equal bandwidth. This sounds almost like a software issue. Is this possibly video drivers doing something funky like breaking the transfer into many small packets and losing overhead that way?

I’m starting to wonder how many people are running x16 motherboards but only seeing x8 speeds. If anyone else is reading this, can you post your specs and what you get from the bandwidth test? Be sure to run with “–memory=pinned”. parameter

2.5-3.0Gb are really the maximum speeds that you will see with PCI 16x, on any hardware. I’m quite sure that’s a PCI express issue, and has nothing to do with the software. The only thing the software does for a pinned memory copy is issue a DMA request to the GPU (as the pinned host memory is already mapped into GPU memory, that’s all that has to be done).

That’s not entirely true, though. tachyon said he is getting 3900MB/s with linkboost. Linkboost increases bus frequency(=bandwidth) by 25% which means without it he should theoretically be getting 3120MB/s. That’s a LOT higher than the 1900MB/s I’m seeing.

I’m surprised how everyone is just writing this off as 'PCI-E just isn’t that fast" issue. If it was slow across the board, fine. But we are seeing massive discrepancies between transfer speed on high end hardware, clearly, there is component (whether soft of hardware) that is causing PCI-E to either be fast or not. The NVidia tech person I spoke with is getting well over 3GB/s and he’s not using linkboost afaik. So that means some component can make a difference of +/- 1GB/s. That’s a massive, massive difference in bandwidth.

I still have this issue open with nvidia and he is currently reproducing my exact hardware config to see if he gets the same throughput.

Anyone know of another tool that can profile PCIE bandwidth? I searched for a long time last night and could not find anything. In fact, I really found nothing at all talking about real world PCIE bandwidth, everyone juts assumes it’s 4gb/s. I did run cpu-z and it verified that my PCIE is running at 16x.

You can get 3.2 GB/s with pinned memory ( I have seen that number of several motherboards).

The actual number will depend on hardware details of the motherboard: for example PCI-e slots coming out from the southbridge instead of the northbridge, problem related to non-flat topology on HT based system,etc.

There is no way of predicting the PCI-e bandwidth without running a test.

Thanks, now we are getting somewhere! Is there any resource where you can look up motherboards & measured PCIE speeds? (Or other rules to apply such as whether or not it’s on the north bridge?)

I would assume nvidia nforce hardware is going to be the best - especially with that pcie overclock feature?

Also, do you know of any software that can measure PCIE bandwidth, besides the bandwidth test (which requires the special CUDA driver, is is not really that accessible to the general public)?

Just curious. Which ones?

We have a few Sun machines that have normal PCIe x16 (not LinkBoost overclocked) that do quite well.

Sun Ultra 20 (2.2GHz Opteron 148) /w GeForce 8800GTX:

Non-pinned host to GPU: 1131.29MB/sec

     Pinned host to GPU:  3193.09MB/sec

Non-pinned GPU to host: 1085.20MB/sec

     Pinned GPU to host:  3077.97MB/sec

Sun Ultra 40 (2x2.4GHz Opteron 280) /w Quadro FX 5600

Non-pinned host to GPU: 1382.96MB/sec

     Pinned host to GPU:   2283.26MB/sec

Non-pinned GPU to host: 1371.61MB/sec

     Pinned GPU to host:   2966.28MB/sec

NCSA has some HP boxes with QuadroPlexes attached that are also close to 3200MB/sec.

Cheers,

John Stone

Hi!
After a few test muste I find out that my values for the bandwidth test are very bad. Where is the error!

here my bandwidth results of the test. [file]!

my maschine:

Prozessor
Modell : Intel® Pentium® 4 CPU 3.20GHz
Geschwindigkeit : 3.19GHz
Kerne pro Prozessor : 1 Einheit(en)
Threads pro Kern : 2 Einheit(en)
Interner Datencache : 16kB, Synchron, Write-Thru, 8-weg Satz, 64 Byte Zeilengröße, 2 Zeilen pro Sektor
L2 Onboard Cache : 1MB, ECC, Synchron, ATC, 8-weg Satz, 64 Byte Zeilengröße, 2 Zeilen pro Sektor

System
System : Dell Inc. Optiplex GX280
Mainboard : Dell Inc. 0G5611
Bus(se) : X-Bus PCI PCIe IMB USB FireWire/1394 i2c/SMBus
MP Unterstützung : 1 Prozessor(en)
MP APIC : Ja
System BIOS : Dell Inc. A00
Gesamtspeicher : 1022.57MB DDR2

Chipsatz 1
Modell : Dell 82915G/GV/GL/P/PL/GL/910GE/GL Grantsdale Host Bridge/DRAM Controller
Front Side Bus Geschwindigkeit : 4x 200MHz (800MHz)
Gesamtspeicher : 1GB DDR2
Speicherbusgeschwindigkeit : 4x 100MHz (400MHz)

Grafiksystem
Adapter : NVIDIA GeForce 8800 GTX
results.txt (4.94 KB)

System: Dell Precision 490

Processor: Intel Quad Xeon 5355 @ 2.6GHz

Memory: 2GB Ram

GPU: 8800 GTS 320MB @PCIEX-16

OS: Fedora Core 6 - 2.6.22.7-57.fc6

Pageable:

Quick Mode

Host to Device Bandwidth for Pageable memory

.

Transfer Size (Bytes)	Bandwidth(MB/s)

 33554432  1543.3

Quick Mode

Device to Host Bandwidth for Pageable memory

.

Transfer Size (Bytes)	Bandwidth(MB/s)

 33554432  1238.3

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes)	Bandwidth(MB/s)

 33554432  47558.9

&&&& Test PASSED

Pinned:

Quick Mode

Host to Device Bandwidth for Pinned memory

.

Transfer Size (Bytes)	Bandwidth(MB/s)

 33554432  2517.9

Quick Mode

Device to Host Bandwidth for Pinned memory

.

Transfer Size (Bytes)	Bandwidth(MB/s)

 33554432  3022.1

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes)	Bandwidth(MB/s)

 33554432  47573.0

&&&& Test PASSED