PCI Express Latency and how to decrease it

I am benchmarking transfers of data from pinned host memory to device memory and back.

My program transfers 1MB to 512MB (i.e. 512 separate transfers of 1MB, 2MB, 3MB, … 512MB) of data from the host to the device. Each transfer is timed and repeated 20 times to get an average time for each transfer. This is then repeated transferring data from device to host. These timings are loaded into OpenOffice Calc and are plotted against the size of the transfer. A linear trend line is then added and the formula for it taken. The formula is of the form y = mx + c where y is the time, m is the inverse bandwidth, x is the size and c is the latency of the transfer. An average is then taken for all the transfers to calculate the mean latency and bandwidth. This is similar to the technique V. Volkov uses in his paper LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs.

The PC used for the tests is a Dell XPS 730X (Intel Core i7 965 @ 3.2GHz, 6GB DDR3-1066MHz RAM, 2x nVidia GeForce 285 GTX 1GB connected via PCI Express 2.0 x16 configured for SLI). My results are that host to device transfers have a latency of 35 microseconds (70 microseconds if the GPU has a display attached) and device to host transfers have a latency of 266 microseconds (286 microseconds if the GPU has a display attached). Bandwidth in all cases is 5.7GB/s which is about 70% peak for the PCI Express bus (8GB/s). In contrast V. Volkov gets a latency of 15 microseconds although this was with an older card (nVidia GeForce 8800 GTX 768MB connected via PCI Express 1.1).

Is there any way I can improve my results? I don’t know whether the problem is with the hardware configuration, BIOS settings, software versions or software configuration.

I’m running 64bit Gentoo Linux (2.6.37 kernel) with CUDA 3.2 and nvidia-drivers-260.19.29.

Are you calling cudaThreadSynchronize between the kernel call and the memory transfer? Otherwise the kernel will be async and and the device to host memcpy will include the kernel execution time.

I observed similar results before. This is horrible for multi-GPU programs. Later I updated driver and cuda. The latency is about 11 microseconds now (device to host, display is attached), 6.5 microseconds (host ot device).

Environment: Intel X5570 + Tesla C2050 + Fedora 13(x86_64, + nvidia-drivers-260.19.21 + CUDA_3.2.16 + gcc4.4.5

Using built-in specs.

Target: x86_64-redhat-linux

Configured with: …/configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --with-java-home=/usr/lib/jvm/java-1.5.0-gcj- --enable-libgcj-multifile --enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux

Thread model: posix

gcc version 4.4.5 20101112 (Red Hat 4.4.5-2) (GCC)

nvcc: NVIDIA ® Cuda compiler driver

Copyright © 2005-2010 NVIDIA Corporation

Built on Wed_Nov__3_16:16:57_PDT_2010

Cuda compilation tools, release 3.2, V0.2.1221


title Fedora (

    root (hd0,5)

    kernel /boot/vmlinuz- ro root=UUID=302e3b99-abd6-4d15-80de-2bfc377b6025 rd_NO_LUKS rd_NO_LVM rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYTABLE=us rhgb quiet nouveau.modeset=0 rdblacklist=nouveau vmalloc=256m

    initrd /boot/initramfs-

I also disabled ECC.

Thanks for the details.

So it is a software issue then? I’m not sure if this machine has ECC memory installed but I doubt it. I’ll try some different driver versions (or maybe start sticking to the mainstream, non-developer versions) and see if that helps.

It should be a software issue, I guess. Driver or/and cuda.

Long latency is unacceptable especially for multi-GPU programs. I think NVIDIA should make more efforts to optimize communication :)

I assume what is being timed here are synchronous cudaMemcpy() calls, using pinned memory on the host. This means we are measuring the full round-trip latency of the cudaMemcpy() call, which obviously exceeds the pure PCIe latency. The performance of host <-> device transfers will naturally depend somewhat on the host. Below I show performance numbers that I generated on an older workstation with PCIe gen2 I have here, running with a Tesla C2050 using the CUDA 3.2 software stack. My timing methodology is similar to STREAM in that I report the fastest of three executions, as this keeps down the measurement noise level for short duration events. I used gettimeofday() for high precision timing. Since this is a workstation, I am running with X and a display attached. The results seem normal to me, nothing suggests a software problem.

Xeon X5272 @ 3.4 GHz, Tesla C2050, CUDA 3.2, RHEL Linux 64-bit

ECC on


^^^^ h2d: bytes=         1  time=     15.97 usec  rate=0.06MB/sec

^^^^ h2d: bytes=         8  time=     15.02 usec  rate=0.53MB/sec

^^^^ h2d: bytes=        64  time=     15.97 usec  rate=4.01MB/sec

^^^^ h2d: bytes=       512  time=     15.02 usec  rate=34.09MB/sec

^^^^ h2d: bytes=      4096  time=     15.97 usec  rate=256.42MB/sec

^^^^ h2d: bytes=     32768  time=     20.03 usec  rate=1636.18MB/sec

^^^^ h2d: bytes=    262144  time=     57.94 usec  rate=4524.74MB/sec

^^^^ h2d: bytes=   2097152  time=    352.86 usec  rate=5943.31MB/sec

^^^^ h2d: bytes=  16777216  time=   2711.06 usec  rate=6188.44MB/sec

^^^^ d2h: bytes=         1  time=     15.02 usec  rate=0.07MB/sec

^^^^ d2h: bytes=         8  time=     15.02 usec  rate=0.53MB/sec

^^^^ d2h: bytes=        64  time=     15.97 usec  rate=4.01MB/sec

^^^^ d2h: bytes=       512  time=     15.97 usec  rate=32.05MB/sec

^^^^ d2h: bytes=      4096  time=     15.97 usec  rate=256.42MB/sec

^^^^ d2h: bytes=     32768  time=     20.03 usec  rate=1636.18MB/sec

^^^^ d2h: bytes=    262144  time=     54.84 usec  rate=4780.49MB/sec

^^^^ d2h: bytes=   2097152  time=    329.97 usec  rate=6355.56MB/sec

^^^^ d2h: bytes=  16777216  time=   2544.88 usec  rate=6592.54MB/sec

ECC off


^^^^ h2d: bytes=         1  time=      6.91 usec  rate=0.14MB/sec

^^^^ h2d: bytes=         8  time=      6.91 usec  rate=1.16MB/sec

^^^^ h2d: bytes=        64  time=      6.91 usec  rate=9.26MB/sec

^^^^ h2d: bytes=       512  time=      6.91 usec  rate=74.05MB/sec

^^^^ h2d: bytes=      4096  time=      6.91 usec  rate=592.41MB/sec

^^^^ h2d: bytes=     32768  time=     11.92 usec  rate=2748.78MB/sec

^^^^ h2d: bytes=    262144  time=     48.88 usec  rate=5363.47MB/sec

^^^^ h2d: bytes=   2097152  time=    343.80 usec  rate=6099.93MB/sec

^^^^ h2d: bytes=  16777216  time=   2705.81 usec  rate=6200.44MB/sec

^^^^ d2h: bytes=         1  time=      6.91 usec  rate=0.14MB/sec

^^^^ d2h: bytes=         8  time=      6.91 usec  rate=1.16MB/sec

^^^^ d2h: bytes=        64  time=      6.91 usec  rate=9.26MB/sec

^^^^ d2h: bytes=       512  time=      6.91 usec  rate=74.05MB/sec

^^^^ d2h: bytes=      4096  time=      6.91 usec  rate=592.41MB/sec

^^^^ d2h: bytes=     32768  time=     10.97 usec  rate=2987.80MB/sec

^^^^ d2h: bytes=    262144  time=     46.01 usec  rate=5696.95MB/sec

^^^^ d2h: bytes=   2097152  time=    321.87 usec  rate=6515.62MB/sec

^^^^ d2h: bytes=  16777216  time=   2533.91 usec  rate=6621.07MB/sec

Yes you are correct. The memory is being allocated on the host with cuMemAllocHost and transferred using cuMemcpyHtoD. I wasn’t sure what to call the values for ‘c’ (in “y = mx + c”) so chose “PCIe latency”. While I’m sure it’s not quite as simple as that (as you point out) it does provide an accurate means to measure the overall (“round-trip”) latency of a host-device transfer. Not shown are the results taken when running the benchmark on my laptop (Thinkpad T61, C2D T7300 @ 2GHz, Quadro NVS 140M). The latency on this was much less (~10 microseconds) although so was the bandwidth (3300MB/s ~70%). As the machine I have been given was sold as a gaming machine set up for SLI I assumed this discrepancy would be down to how the systems had been set up and there would be a BIOS setting or similar I could tweak to improve the latency.

I upgraded the drivers on both systems to 270.18 and it seems to have improved performance confirming this is (also) a software issue. Transfer latency is now less than 10 microseconds on both systems while bandwidth is up to 75% the theoretical peak. Good job nVidia!

Thanks for the feedback. I actually compared multiple driver revisions (including r270) and did not see much of a performance difference with my workstation with Tesla C2050. The improvement you see from switching to r270 drivers may be due to the interaction of a particular host platform with a particular GPU using a particular driver. There is a large number of possible combinations so that it is impossible to track all of them. What I have seen with some older host systems is that the speed of host<->device transfers can be limited by the host’s system memory throughput.