BIOS or Hardware settings to improve pageable memory bandwidth

Here is a screenshot of the current bandwidth situation(click on image to make it bigger):

[url]http://imgur.com/7VsXwCI[/url]

so there is a big difference between the ‘pinned’ and pageable memory bandwidth.

Is there any BIOS setting or other environment variable which can skew the ratio more in favor of pageable memory?

ASUS x79 deluxe motherboard Window 7 Ultimate

I am assuming you are performing these experiments on a machine that is otherwise idle, i.e. one can be reasonably sure that neither the test application task nor the system memory pages it is touching are swapped out at any time during the test.

The numbers for transfers from “pageable memory” look a bit low. The performance of those transfers is heavily dependent on the performance of the system memory, as they involve a copy from pageable system memory into a pinned system memory buffer that is setup by the driver, followed by DMA transfer from the pinned system memory buffer to the GPU (or vice versa for transfers from device to host).

Compared to transfers from “pinned memory”, this introduces two issues: (1) system memory bandwidth requirements are doubled (2) the pinned buffer set up by the driver has a limited size, potentially requiring multiple transfers. In general, I have found that the second issue has only minor impact, and that overall transfer rate from/to pageable memory is, to first order, determined by system memory throughput.

I would suggest to start out by measuring the system memory throughput, for example with STREAM, to see whether the results are in the expected range. The expected memory throughput will vary with the specifics of your DDR3 configuration, such as speed grade of the DIMMs, whether DIMMS are unbuffered, registered, dual-ranked; but order-of-magnitude you should see total bandwidth of about 25 GB/sec for two-channel configurations (e.g. workstation) and 50 GB/sec for four-channel configurations (typically used for servers) on modern hardware (IvyBridge, Haswell).

Special care needs to be applied when such measurements are performend on platforms with multiple CPU sockets, in which case memory access becomes non-uniform as each CPU has its own memory controller. One would want to control CPU affinity and memory affinity such that the GPU communicates with the “near” CPU and the “near” memory. Under Linux one typically uses numactl to do this, I do not know how it works under Windows.

[Later:]

I am not sure what your experience with CUDA-Z is, but I just downloaded it from SourceForge and see significant discrepancies between measurements from my own test app vs. the ones from CUDA-Z. I have a Windows 7 system here with a Tesla C2050, so PCIe2. The CPU is an Intel i7-3820 @ 3.60 GHz (that is IvyBridge, I think), with two DDR3 channels. My measurements (using a block size of 16 MB) show for transfers from / to pageable memory:

host->device 3496 MB/sec
device->host 4371 MB/sec

Measured throughput seems to fluctuate by about 5%. The CUDA-Z 0.8.207 utility shows for transfers from / to pageable memory:

host->device 2295 MB/sec
device->host 3929 MB/sec

In addition, measured throughput fluctuates wildely, and I see values as small as 1900 / 2500 MB/sec. The lower values and massive fluctuation could be an indication that CUDA-Z uses a smaller blocksize. Without additional analysis I would tend to trust results from my own app more :-)

If this is your motherboard [url]http://www.asus.com/Motherboards/P9X79_DELUXE/[/url] it appears to support quad-channel system memory, which is unusual for a single-socket system. This online review of the board shows measured system memory bandwidth close to 50 GB/sec:

[url]http://www.hardocp.com/article/2013/09/18/asus_x79_deluxe_lga_2011_motherboard_review[/url]

yes, that appears to be the same MOBO. I just need some trial-and-error to optimize the system, since I build it myself.

I just use CUDA-Z to get a general idea of the operations of the device(s). Those numbers do vary from the similar tests in the CUDA SDK, or my own tests.

For example for the K20c CUDA-Z shows the DP FMA at 1166 Gflops (99% of theoretical peak), while the SP FMA via CUDA-Z shows up at 2057 Gflops( ~59% of theoretical peak). But in my own tests the SP number are much better than shown in CUDA-Z, so I do not place too much importance on that output.

The two major items to check in terms of maximizing system memory throughput would probably be

(1) Ensure that all four memory channels are populated.
(2) Install the fastest speed grade of DDR-3 supported by the CPU in the system.

There are other aspects one can optimize, e.g. selecting the lowest latency DDR-3 for a particular speed grade. These latencies are usually listed as a string of numbers such as “7-8-8-24”. Lower numbers are better. My past experience with optimizing these latencies is that they make for very minor application level performance differences, low single-digit percentages.

From the description of the board it seems to provide a copious amount of overclocking and tweaking switches in the BIOS. You might want to ask on an enthusiast forum about advice on those.