20% of the bandwidth is missing

My GPU is using a Gen 2.0 PCIe x16 @ x8 slot. The bandwidth, according to Wikipedia is supposed to be 40 GT/s, or due to the 8b/10b coding, 4GB/s.

Instead, I get

$ ./bandwidthTest 
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX 750 Ti
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     3098.7

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     3274.1

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     70383.8

So, about 20% of the bandwidth is missing. What could be causing this?

The available bandwidth is not the peak theoretical bandwidth.

The peak theoretical bandwidth for a Gen2 link is 5GT/s per lane. So a x8 link would have 40GT/s peak, which would be 5GB/s. This is reduced by the 8bit/10bit encoding on PCIE, to 4GB/s. Finally the 128byte MTU (basically, packet size) also impacts overall transfer efficiency, with about a further 25% reduction in available bandwidth, taking the 4GB/s number down to 3GB/s, which is what you are observing. 3GB/s for a PCIE Gen2 x8 link is approximately “normal”.

Most of my datapoints I mention can be found in the Wikipedia article. You can find the effect of MTU on throughput (and the ~75% number corresponding to 128byte MTU) here:

http://www.plxtech.com/files/pdf/technical/expresslane/Choosing_PCIe_Packet_Payload_Size.pdf

If this is the cause of the bandwidth reduction, I still don’t understand why 0.0001MB packet size would be used when transferring 33MB?

The max payload size (packet size) is the lower of the max payload size supported by the root complex (i.e. motherboard) and the max payload size supported by the endpoint (i.e. GPU). You can inspect these values directly using lspci on linux. On the particular Dell workstation (T3500) that I happened to look at, the (root complex) max payload size was not a BIOS adjustable option (although it may be on some motherboards). Using lspci -vvvx, I could see that the max payload size supported by the root complex was 256 bytes, whereas the max supported by the GPU was 128 bytes, and so 128 bytes was the configured value.

The choice of 128 made by the GPU is probably a compromise. If there are a mix of large and small packets, choosing a very large size (like 4096, the max supported by PCIE) would provide a benefit to these large transfers but could otherwise “penalize” short message PCIE traffic.

I see. Thanks!