The most likely cause of this extremely low host/device throughput is that the GPU is plugged into the wrong PCIe slot. It should go into a PCIe gen3 x16 capable slot, and this should result in a transfer rate of 12+ GB/sec.
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: Quadro P2000
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12327.1
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12364.3
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 119536.8
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Check your PCIe link configuration by looking at the output of nvidia-smi -q:
GPU Link Info
PCIe Generation
Max : 3
Current : 3 <--------------
Link Width
Max : 16x
Current : 16x <--------------
If you look at this while bandwidthTest or some other CUDA software which uses frequent host/device transfers is running (I took the above snapshot with Folding@Home running), the “Current” items should show generation = 3, link width = x16. You can also use 3rd party software like TechPowerUp’s GPU-Z to monitor the link configuration.