cudaHostAlloc performance is slow

I am trying to use cudaHostAlloc to speed the host to device memory transfer.
I allocated host buffers with cudaHostAlloc (… cudaHostAllocDefault)
Expected data transfer about 2x faster.

The task is to allocate 2 host buffers size 2 MB each buffer,
1 host buffer for output,
load data, copy to device, do some pixel processing,
then copy one result buffer to host.
Pixel processing is very simple, task should be bounded by memory transfer.

But I measured only a little performance improvement vs. simply malloc host buffers.
Only about 20% speed improvement.

Is there something wrong or paged memory copy is equally fast on my system?

My system has
Intel core i-7 2600K
GeForce GTX 480 garphic card (Fermi),
Windows7 64 bit.

It is not clear to me what your concern is about. The performance of the cudaHostAlloc() and cudaHostAllocDefault() API calls themselves should be bounded by the speed of the underlying OS operations. The performance of host<->device transfers will depend on the performance of the PCIe link and the performance of the system memory when of pagable memory is used. It would be best to measure these transfer rates on your platform in isolation to make sure they are in the expected range. I previously posted some data for an M2090 on a slightly older Nehalem system here:

As you can see, the throughput for host->device copies with pinned buffers is only 20% better than for paged buffers, presumably due to the speedy system memory on this platform that minimizes the overhead of the host->host copy necessary for transfers from pagable memory (i.e. copy from the user’s buffer into a pinned buffer pre-allocated by the driver, from where it can be DMAed to the device).