BIOS or Hardware settings to improve pageable memory bandwidth

CudaaduC · June 17, 2014, 4:33am

Here is a screenshot of the current bandwidth situation(click on image to make it bigger):

so there is a big difference between the ‘pinned’ and pageable memory bandwidth.

Is there any BIOS setting or other environment variable which can skew the ratio more in favor of pageable memory?

ASUS x79 deluxe motherboard Window 7 Ultimate

njuffa · June 18, 2014, 10:17pm

I am assuming you are performing these experiments on a machine that is otherwise idle, i.e. one can be reasonably sure that neither the test application task nor the system memory pages it is touching are swapped out at any time during the test.

The numbers for transfers from “pageable memory” look a bit low. The performance of those transfers is heavily dependent on the performance of the system memory, as they involve a copy from pageable system memory into a pinned system memory buffer that is setup by the driver, followed by DMA transfer from the pinned system memory buffer to the GPU (or vice versa for transfers from device to host).

Compared to transfers from “pinned memory”, this introduces two issues: (1) system memory bandwidth requirements are doubled (2) the pinned buffer set up by the driver has a limited size, potentially requiring multiple transfers. In general, I have found that the second issue has only minor impact, and that overall transfer rate from/to pageable memory is, to first order, determined by system memory throughput.

I would suggest to start out by measuring the system memory throughput, for example with STREAM, to see whether the results are in the expected range. The expected memory throughput will vary with the specifics of your DDR3 configuration, such as speed grade of the DIMMs, whether DIMMS are unbuffered, registered, dual-ranked; but order-of-magnitude you should see total bandwidth of about 25 GB/sec for two-channel configurations (e.g. workstation) and 50 GB/sec for four-channel configurations (typically used for servers) on modern hardware (IvyBridge, Haswell).

Special care needs to be applied when such measurements are performend on platforms with multiple CPU sockets, in which case memory access becomes non-uniform as each CPU has its own memory controller. One would want to control CPU affinity and memory affinity such that the GPU communicates with the “near” CPU and the “near” memory. Under Linux one typically uses numactl to do this, I do not know how it works under Windows.

[Later:]

I am not sure what your experience with CUDA-Z is, but I just downloaded it from SourceForge and see significant discrepancies between measurements from my own test app vs. the ones from CUDA-Z. I have a Windows 7 system here with a Tesla C2050, so PCIe2. The CPU is an Intel i7-3820 @ 3.60 GHz (that is IvyBridge, I think), with two DDR3 channels. My measurements (using a block size of 16 MB) show for transfers from / to pageable memory:

host->device 3496 MB/sec
device->host 4371 MB/sec

Measured throughput seems to fluctuate by about 5%. The CUDA-Z 0.8.207 utility shows for transfers from / to pageable memory:

host->device 2295 MB/sec
device->host 3929 MB/sec

In addition, measured throughput fluctuates wildely, and I see values as small as 1900 / 2500 MB/sec. The lower values and massive fluctuation could be an indication that CUDA-Z uses a smaller blocksize. Without additional analysis I would tend to trust results from my own app more :-)

njuffa · June 21, 2014, 8:23pm

If this is your motherboard [url]http://www.asus.com/Motherboards/P9X79_DELUXE/[/url] it appears to support quad-channel system memory, which is unusual for a single-socket system. This online review of the board shows measured system memory bandwidth close to 50 GB/sec:

[url]http://www.hardocp.com/article/2013/09/18/asus_x79_deluxe_lga_2011_motherboard_review[/url]

CudaaduC · June 22, 2014, 3:01am

yes, that appears to be the same MOBO. I just need some trial-and-error to optimize the system, since I build it myself.

I just use CUDA-Z to get a general idea of the operations of the device(s). Those numbers do vary from the similar tests in the CUDA SDK, or my own tests.

For example for the K20c CUDA-Z shows the DP FMA at 1166 Gflops (99% of theoretical peak), while the SP FMA via CUDA-Z shows up at 2057 Gflops( ~59% of theoretical peak). But in my own tests the SP number are much better than shown in CUDA-Z, so I do not place too much importance on that output.

njuffa · June 22, 2014, 6:55pm

The two major items to check in terms of maximizing system memory throughput would probably be

(1) Ensure that all four memory channels are populated.
(2) Install the fastest speed grade of DDR-3 supported by the CPU in the system.

There are other aspects one can optimize, e.g. selecting the lowest latency DDR-3 for a particular speed grade. These latencies are usually listed as a string of numbers such as “7-8-8-24”. Lower numbers are better. My past experience with optimizing these latencies is that they make for very minor application level performance differences, low single-digit percentages.

From the description of the board it seems to provide a copious amount of overclocking and tweaking switches in the BIOS. You might want to ask on an enthusiast forum about advice on those.

Topic		Replies	Views
Pinned and Pageable memory CUDA Programming and Performance	5	2406	January 16, 2020
Performance of Paged Memory on GTX670 with CUDA On New Xeon System CUDA Programming and Performance	4	1578	March 1, 2013
page-locked memory: alignment? reason: inconsistent results for memcopy CUDA Programming and Performance	4	8669	March 18, 2008
Why is the transfer throughput low when transferring small size data from Host to Device (or Device to Host)? CUDA Programming and Performance	8	2121	October 12, 2021
How to understand CPU memory transfer data to GPU memory speed problem CUDA Programming and Performance	4	3784	December 18, 2017
About pinned memory and its effectiveness CUDA Programming and Performance	3	1331	April 15, 2009
x58 Chipset PCIE Bandwidth Any improvement? CUDA Programming and Performance	47	21965	December 14, 2008
CUDA 8.0 CudaMemcpy with Pageable Memory CUDA Programming and Performance	13	3177	December 6, 2016
Memory bandwidth CUDA Programming and Performance	31	38397	October 5, 2007
Does pageable memory have higher memory consumption than pinned memory? CUDA Programming and Performance	5	980	October 12, 2021

BIOS or Hardware settings to improve pageable memory bandwidth

Related topics