I am assuming you are performing these experiments on a machine that is otherwise idle, i.e. one can be reasonably sure that neither the test application task nor the system memory pages it is touching are swapped out at any time during the test.
The numbers for transfers from “pageable memory” look a bit low. The performance of those transfers is heavily dependent on the performance of the system memory, as they involve a copy from pageable system memory into a pinned system memory buffer that is setup by the driver, followed by DMA transfer from the pinned system memory buffer to the GPU (or vice versa for transfers from device to host).
Compared to transfers from “pinned memory”, this introduces two issues: (1) system memory bandwidth requirements are doubled (2) the pinned buffer set up by the driver has a limited size, potentially requiring multiple transfers. In general, I have found that the second issue has only minor impact, and that overall transfer rate from/to pageable memory is, to first order, determined by system memory throughput.
I would suggest to start out by measuring the system memory throughput, for example with STREAM, to see whether the results are in the expected range. The expected memory throughput will vary with the specifics of your DDR3 configuration, such as speed grade of the DIMMs, whether DIMMS are unbuffered, registered, dual-ranked; but order-of-magnitude you should see total bandwidth of about 25 GB/sec for two-channel configurations (e.g. workstation) and 50 GB/sec for four-channel configurations (typically used for servers) on modern hardware (IvyBridge, Haswell).
Special care needs to be applied when such measurements are performend on platforms with multiple CPU sockets, in which case memory access becomes non-uniform as each CPU has its own memory controller. One would want to control CPU affinity and memory affinity such that the GPU communicates with the “near” CPU and the “near” memory. Under Linux one typically uses numactl to do this, I do not know how it works under Windows.
I am not sure what your experience with CUDA-Z is, but I just downloaded it from SourceForge and see significant discrepancies between measurements from my own test app vs. the ones from CUDA-Z. I have a Windows 7 system here with a Tesla C2050, so PCIe2. The CPU is an Intel i7-3820 @ 3.60 GHz (that is IvyBridge, I think), with two DDR3 channels. My measurements (using a block size of 16 MB) show for transfers from / to pageable memory:
host->device 3496 MB/sec
device->host 4371 MB/sec
Measured throughput seems to fluctuate by about 5%. The CUDA-Z 0.8.207 utility shows for transfers from / to pageable memory:
host->device 2295 MB/sec
device->host 3929 MB/sec
In addition, measured throughput fluctuates wildely, and I see values as small as 1900 / 2500 MB/sec. The lower values and massive fluctuation could be an indication that CUDA-Z uses a smaller blocksize. Without additional analysis I would tend to trust results from my own app more :-)