PCI Express Latency and how to decrease it

gmacindoe · January 19, 2011, 10:42pm

I am benchmarking transfers of data from pinned host memory to device memory and back.

My program transfers 1MB to 512MB (i.e. 512 separate transfers of 1MB, 2MB, 3MB, … 512MB) of data from the host to the device. Each transfer is timed and repeated 20 times to get an average time for each transfer. This is then repeated transferring data from device to host. These timings are loaded into OpenOffice Calc and are plotted against the size of the transfer. A linear trend line is then added and the formula for it taken. The formula is of the form y = mx + c where y is the time, m is the inverse bandwidth, x is the size and c is the latency of the transfer. An average is then taken for all the transfers to calculate the mean latency and bandwidth. This is similar to the technique V. Volkov uses in his paper LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs.

The PC used for the tests is a Dell XPS 730X (Intel Core i7 965 @ 3.2GHz, 6GB DDR3-1066MHz RAM, 2x nVidia GeForce 285 GTX 1GB connected via PCI Express 2.0 x16 configured for SLI). My results are that host to device transfers have a latency of 35 microseconds (70 microseconds if the GPU has a display attached) and device to host transfers have a latency of 266 microseconds (286 microseconds if the GPU has a display attached). Bandwidth in all cases is 5.7GB/s which is about 70% peak for the PCI Express bus (8GB/s). In contrast V. Volkov gets a latency of 15 microseconds although this was with an older card (nVidia GeForce 8800 GTX 768MB connected via PCI Express 1.1).

Is there any way I can improve my results? I don’t know whether the problem is with the hardware configuration, BIOS settings, software versions or software configuration.

I’m running 64bit Gentoo Linux (2.6.37 kernel) with CUDA 3.2 and nvidia-drivers-260.19.29.

ljbadenz · January 24, 2011, 8:42pm

Are you calling cudaThreadSynchronize between the kernel call and the memory transfer? Otherwise the kernel will be async and and the device to host memcpy will include the kernel execution time.

monramax · January 26, 2011, 9:23pm

I am benchmarking transfers of data from pinned host memory to device memory and back.

My program transfers 1MB to 512MB (i.e. 512 separate transfers of 1MB, 2MB, 3MB, … 512MB) of data from the host to the device. Each transfer is timed and repeated 20 times to get an average time for each transfer. This is then repeated transferring data from device to host. These timings are loaded into OpenOffice Calc and are plotted against the size of the transfer. A linear trend line is then added and the formula for it taken. The formula is of the form y = mx + c where y is the time, m is the inverse bandwidth, x is the size and c is the latency of the transfer. An average is then taken for all the transfers to calculate the mean latency and bandwidth. This is similar to the technique V. Volkov uses in his paper LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs.

The PC used for the tests is a Dell XPS 730X (Intel Core i7 965 @ 3.2GHz, 6GB DDR3-1066MHz RAM, 2x nVidia GeForce 285 GTX 1GB connected via PCI Express 2.0 x16 configured for SLI). My results are that host to device transfers have a latency of 35 microseconds (70 microseconds if the GPU has a display attached) and device to host transfers have a latency of 266 microseconds (286 microseconds if the GPU has a display attached). Bandwidth in all cases is 5.7GB/s which is about 70% peak for the PCI Express bus (8GB/s). In contrast V. Volkov gets a latency of 15 microseconds although this was with an older card (nVidia GeForce 8800 GTX 768MB connected via PCI Express 1.1).

Is there any way I can improve my results? I don’t know whether the problem is with the hardware configuration, BIOS settings, software versions or software configuration.

I’m running 64bit Gentoo Linux (2.6.37 kernel) with CUDA 3.2 and nvidia-drivers-260.19.29.

I observed similar results before. This is horrible for multi-GPU programs. Later I updated driver and cuda. The latency is about 11 microseconds now (device to host, display is attached), 6.5 microseconds (host ot device).

Environment: Intel X5570 + Tesla C2050 + Fedora 13(x86_64, 2.6.34.7-61.fc13.x86_64) + nvidia-drivers-260.19.21 + CUDA_3.2.16 + gcc4.4.5

Using built-in specs.

Target: x86_64-redhat-linux

Configured with: …/configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --enable-libgcj-multifile --enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux

Thread model: posix

gcc version 4.4.5 20101112 (Red Hat 4.4.5-2) (GCC)

nvcc: NVIDIA (R) Cuda compiler driver

Built on Wed_Nov__3_16:16:57_PDT_2010

Cuda compilation tools, release 3.2, V0.2.1221

grub.conf:

title Fedora (2.6.34.7-61.fc13.x86_64)

    root (hd0,5)

    kernel /boot/vmlinuz-2.6.34.7-61.fc13.x86_64 ro root=UUID=302e3b99-abd6-4d15-80de-2bfc377b6025 rd_NO_LUKS rd_NO_LVM rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYTABLE=us rhgb quiet nouveau.modeset=0 rdblacklist=nouveau vmalloc=256m

    initrd /boot/initramfs-2.6.34.7-61.fc13.x86_64.img

I also disabled ECC.

gmacindoe · January 27, 2011, 8:20pm

I observed similar results before. This is horrible for multi-GPU programs. Later I updated driver and cuda. The latency is about 11 microseconds now (device to host, display is attached), 6.5 microseconds (host ot device).

Environment: Intel X5570 + Tesla C2050 + Fedora 13(x86_64, 2.6.34.7-61.fc13.x86_64) + nvidia-drivers-260.19.21 + CUDA_3.2.16 + gcc4.4.5

Using built-in specs.

Target: x86_64-redhat-linux

Configured with: …/configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --enable-libgcj-multifile --enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux

Thread model: posix

gcc version 4.4.5 20101112 (Red Hat 4.4.5-2) (GCC)

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2010 NVIDIA Corporation

Built on Wed_Nov__3_16:16:57_PDT_2010

Cuda compilation tools, release 3.2, V0.2.1221

grub.conf:

title Fedora (2.6.34.7-61.fc13.x86_64)
    root (hd0,5)

    kernel /boot/vmlinuz-2.6.34.7-61.fc13.x86_64 ro root=UUID=302e3b99-abd6-4d15-80de-2bfc377b6025 rd_NO_LUKS rd_NO_LVM rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYTABLE=us rhgb quiet nouveau.modeset=0 rdblacklist=nouveau vmalloc=256m

    initrd /boot/initramfs-2.6.34.7-61.fc13.x86_64.img
I also disabled ECC.

Thanks for the details.

So it is a software issue then? I’m not sure if this machine has ECC memory installed but I doubt it. I’ll try some different driver versions (or maybe start sticking to the mainstream, non-developer versions) and see if that helps.

monramax · January 27, 2011, 9:28pm

It should be a software issue, I guess. Driver or/and cuda.

Long latency is unacceptable especially for multi-GPU programs. I think NVIDIA should make more efforts to optimize communication :)

njuffa · January 29, 2011, 4:37am

I assume what is being timed here are synchronous cudaMemcpy() calls, using pinned memory on the host. This means we are measuring the full round-trip latency of the cudaMemcpy() call, which obviously exceeds the pure PCIe latency. The performance of host <-> device transfers will naturally depend somewhat on the host. Below I show performance numbers that I generated on an older workstation with PCIe gen2 I have here, running with a Tesla C2050 using the CUDA 3.2 software stack. My timing methodology is similar to STREAM in that I report the fastest of three executions, as this keeps down the measurement noise level for short duration events. I used gettimeofday() for high precision timing. Since this is a workstation, I am running with X and a display attached. The results seem normal to me, nothing suggests a software problem.

Xeon X5272 @ 3.4 GHz, Tesla C2050, CUDA 3.2, RHEL Linux 64-bit

ECC on

======

^^^^ h2d: bytes=         1  time=     15.97 usec  rate=0.06MB/sec

^^^^ h2d: bytes=         8  time=     15.02 usec  rate=0.53MB/sec

^^^^ h2d: bytes=        64  time=     15.97 usec  rate=4.01MB/sec

^^^^ h2d: bytes=       512  time=     15.02 usec  rate=34.09MB/sec

^^^^ h2d: bytes=      4096  time=     15.97 usec  rate=256.42MB/sec

^^^^ h2d: bytes=     32768  time=     20.03 usec  rate=1636.18MB/sec

^^^^ h2d: bytes=    262144  time=     57.94 usec  rate=4524.74MB/sec

^^^^ h2d: bytes=   2097152  time=    352.86 usec  rate=5943.31MB/sec

^^^^ h2d: bytes=  16777216  time=   2711.06 usec  rate=6188.44MB/sec

^^^^ d2h: bytes=         1  time=     15.02 usec  rate=0.07MB/sec

^^^^ d2h: bytes=         8  time=     15.02 usec  rate=0.53MB/sec

^^^^ d2h: bytes=        64  time=     15.97 usec  rate=4.01MB/sec

^^^^ d2h: bytes=       512  time=     15.97 usec  rate=32.05MB/sec

^^^^ d2h: bytes=      4096  time=     15.97 usec  rate=256.42MB/sec

^^^^ d2h: bytes=     32768  time=     20.03 usec  rate=1636.18MB/sec

^^^^ d2h: bytes=    262144  time=     54.84 usec  rate=4780.49MB/sec

^^^^ d2h: bytes=   2097152  time=    329.97 usec  rate=6355.56MB/sec

^^^^ d2h: bytes=  16777216  time=   2544.88 usec  rate=6592.54MB/sec

ECC off

=======

^^^^ h2d: bytes=         1  time=      6.91 usec  rate=0.14MB/sec

^^^^ h2d: bytes=         8  time=      6.91 usec  rate=1.16MB/sec

^^^^ h2d: bytes=        64  time=      6.91 usec  rate=9.26MB/sec

^^^^ h2d: bytes=       512  time=      6.91 usec  rate=74.05MB/sec

^^^^ h2d: bytes=      4096  time=      6.91 usec  rate=592.41MB/sec

^^^^ h2d: bytes=     32768  time=     11.92 usec  rate=2748.78MB/sec

^^^^ h2d: bytes=    262144  time=     48.88 usec  rate=5363.47MB/sec

^^^^ h2d: bytes=   2097152  time=    343.80 usec  rate=6099.93MB/sec

^^^^ h2d: bytes=  16777216  time=   2705.81 usec  rate=6200.44MB/sec

^^^^ d2h: bytes=         1  time=      6.91 usec  rate=0.14MB/sec

^^^^ d2h: bytes=         8  time=      6.91 usec  rate=1.16MB/sec

^^^^ d2h: bytes=        64  time=      6.91 usec  rate=9.26MB/sec

^^^^ d2h: bytes=       512  time=      6.91 usec  rate=74.05MB/sec

^^^^ d2h: bytes=      4096  time=      6.91 usec  rate=592.41MB/sec

^^^^ d2h: bytes=     32768  time=     10.97 usec  rate=2987.80MB/sec

^^^^ d2h: bytes=    262144  time=     46.01 usec  rate=5696.95MB/sec

^^^^ d2h: bytes=   2097152  time=    321.87 usec  rate=6515.62MB/sec

^^^^ d2h: bytes=  16777216  time=   2533.91 usec  rate=6621.07MB/sec

gmacindoe · January 31, 2011, 4:43pm

I assume what is being timed here are synchronous cudaMemcpy() calls, using pinned memory on the host. This means we are measuring the full round-trip latency of the cudaMemcpy() call, which obviously exceeds the pure PCIe latency. The performance of host ↔ device transfers will naturally depend somewhat on the host. Below I show performance numbers that I generated on an older workstation with PCIe gen2 I have here, running with a Tesla C2050 using the CUDA 3.2 software stack. My timing methodology is similar to STREAM in that I report the fastest of three executions, as this keeps down the measurement noise level for short duration events. I used gettimeofday() for high precision timing. Since this is a workstation, I am running with X and a display attached. The results seem normal to me, nothing suggests a software problem.

Xeon X5272 @ 3.4 GHz, Tesla C2050, CUDA 3.2, RHEL Linux 64-bit

ECC on

======

^^^^ h2d: bytes=         1  time=     15.97 usec  rate=0.06MB/sec

^^^^ h2d: bytes=         8  time=     15.02 usec  rate=0.53MB/sec

^^^^ h2d: bytes=        64  time=     15.97 usec  rate=4.01MB/sec

^^^^ h2d: bytes=       512  time=     15.02 usec  rate=34.09MB/sec

^^^^ h2d: bytes=      4096  time=     15.97 usec  rate=256.42MB/sec

^^^^ h2d: bytes=     32768  time=     20.03 usec  rate=1636.18MB/sec

^^^^ h2d: bytes=    262144  time=     57.94 usec  rate=4524.74MB/sec

^^^^ h2d: bytes=   2097152  time=    352.86 usec  rate=5943.31MB/sec

^^^^ h2d: bytes=  16777216  time=   2711.06 usec  rate=6188.44MB/sec

^^^^ d2h: bytes=         1  time=     15.02 usec  rate=0.07MB/sec

^^^^ d2h: bytes=         8  time=     15.02 usec  rate=0.53MB/sec

^^^^ d2h: bytes=        64  time=     15.97 usec  rate=4.01MB/sec

^^^^ d2h: bytes=       512  time=     15.97 usec  rate=32.05MB/sec

^^^^ d2h: bytes=      4096  time=     15.97 usec  rate=256.42MB/sec

^^^^ d2h: bytes=     32768  time=     20.03 usec  rate=1636.18MB/sec

^^^^ d2h: bytes=    262144  time=     54.84 usec  rate=4780.49MB/sec

^^^^ d2h: bytes=   2097152  time=    329.97 usec  rate=6355.56MB/sec

^^^^ d2h: bytes=  16777216  time=   2544.88 usec  rate=6592.54MB/sec

ECC off

=======

^^^^ h2d: bytes=         1  time=      6.91 usec  rate=0.14MB/sec

^^^^ h2d: bytes=         8  time=      6.91 usec  rate=1.16MB/sec

^^^^ h2d: bytes=        64  time=      6.91 usec  rate=9.26MB/sec

^^^^ h2d: bytes=       512  time=      6.91 usec  rate=74.05MB/sec

^^^^ h2d: bytes=      4096  time=      6.91 usec  rate=592.41MB/sec

^^^^ h2d: bytes=     32768  time=     11.92 usec  rate=2748.78MB/sec

^^^^ h2d: bytes=    262144  time=     48.88 usec  rate=5363.47MB/sec

^^^^ h2d: bytes=   2097152  time=    343.80 usec  rate=6099.93MB/sec

^^^^ h2d: bytes=  16777216  time=   2705.81 usec  rate=6200.44MB/sec

^^^^ d2h: bytes=         1  time=      6.91 usec  rate=0.14MB/sec

^^^^ d2h: bytes=         8  time=      6.91 usec  rate=1.16MB/sec

^^^^ d2h: bytes=        64  time=      6.91 usec  rate=9.26MB/sec

^^^^ d2h: bytes=       512  time=      6.91 usec  rate=74.05MB/sec

^^^^ d2h: bytes=      4096  time=      6.91 usec  rate=592.41MB/sec

^^^^ d2h: bytes=     32768  time=     10.97 usec  rate=2987.80MB/sec

^^^^ d2h: bytes=    262144  time=     46.01 usec  rate=5696.95MB/sec

^^^^ d2h: bytes=   2097152  time=    321.87 usec  rate=6515.62MB/sec

^^^^ d2h: bytes=  16777216  time=   2533.91 usec  rate=6621.07MB/sec

Yes you are correct. The memory is being allocated on the host with cuMemAllocHost and transferred using cuMemcpyHtoD. I wasn’t sure what to call the values for ‘c’ (in “y = mx + c”) so chose “PCIe latency”. While I’m sure it’s not quite as simple as that (as you point out) it does provide an accurate means to measure the overall (“round-trip”) latency of a host-device transfer. Not shown are the results taken when running the benchmark on my laptop (Thinkpad T61, C2D T7300 @ 2GHz, Quadro NVS 140M). The latency on this was much less (~10 microseconds) although so was the bandwidth (3300MB/s ~70%). As the machine I have been given was sold as a gaming machine set up for SLI I assumed this discrepancy would be down to how the systems had been set up and there would be a BIOS setting or similar I could tweak to improve the latency.

I upgraded the drivers on both systems to 270.18 and it seems to have improved performance confirming this is (also) a software issue. Transfer latency is now less than 10 microseconds on both systems while bandwidth is up to 75% the theoretical peak. Good job nVidia!

njuffa · January 31, 2011, 8:19pm

Thanks for the feedback. I actually compared multiple driver revisions (including r270) and did not see much of a performance difference with my workstation with Tesla C2050. The improvement you see from switching to r270 drivers may be due to the interaction of a particular host platform with a particular GPU using a particular driver. There is a large number of possible combinations so that it is impossible to track all of them. What I have seen with some older host systems is that the speed of host<->device transfers can be limited by the host’s system memory throughput.

Topic		Replies	Views
Inconsistent concurrent transfer speed CUDA Programming and Performance	21	1237	April 17, 2023
PCI Express x16 bandwidth - host<->device transfer Bandwidth is much lower than should be CUDA Programming and Performance	38	68079	April 18, 2008
Question about PCI-E transfer throughput CUDA Programming and Performance	13	152	April 5, 2025
Memory bandwidth CUDA Programming and Performance	31	38430	October 5, 2007
concurrent D2H+H2D transfers? CUDA Programming and Performance	5	2489	May 10, 2016
The change of speed when copying data between host and device CUDA Programming and Performance pcie , cuda , linux	5	1956	October 12, 2021
Poor Memcpy Performance Copying To Pinned Memory On Host CUDA Programming and Performance	16	8021	April 2, 2014
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16338	January 30, 2011
Slow memcpy performance in dual-CPU, 10 GPU system CUDA Programming and Performance cuda , nsight , gpu	24	2295	January 18, 2023
How to Optimize Data Transfers in CUDA C/C++ Technical Blog	12	1210	January 22, 2022

PCI Express Latency and how to decrease it

Related topics