I am benchmarking transfers of data from pinned host memory to device memory and back.
My program transfers 1MB to 512MB (i.e. 512 separate transfers of 1MB, 2MB, 3MB, … 512MB) of data from the host to the device. Each transfer is timed and repeated 20 times to get an average time for each transfer. This is then repeated transferring data from device to host. These timings are loaded into OpenOffice Calc and are plotted against the size of the transfer. A linear trend line is then added and the formula for it taken. The formula is of the form y = mx + c where y is the time, m is the inverse bandwidth, x is the size and c is the latency of the transfer. An average is then taken for all the transfers to calculate the mean latency and bandwidth. This is similar to the technique V. Volkov uses in his paper LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs.
The PC used for the tests is a Dell XPS 730X (Intel Core i7 965 @ 3.2GHz, 6GB DDR3-1066MHz RAM, 2x nVidia GeForce 285 GTX 1GB connected via PCI Express 2.0 x16 configured for SLI). My results are that host to device transfers have a latency of 35 microseconds (70 microseconds if the GPU has a display attached) and device to host transfers have a latency of 266 microseconds (286 microseconds if the GPU has a display attached). Bandwidth in all cases is 5.7GB/s which is about 70% peak for the PCI Express bus (8GB/s). In contrast V. Volkov gets a latency of 15 microseconds although this was with an older card (nVidia GeForce 8800 GTX 768MB connected via PCI Express 1.1).
Is there any way I can improve my results? I don’t know whether the problem is with the hardware configuration, BIOS settings, software versions or software configuration.
I’m running 64bit Gentoo Linux (2.6.37 kernel) with CUDA 3.2 and nvidia-drivers-260.19.29.