Zero Copy VS Page-Locked

Hi to everybody!
I’m an Italian student and I’m new with CUDA and graphic boards and I’m doing some studies with GPU, in particular regarding latency with different types of memory and data storing.
I generate randomly a set of N data (numbers), then I launch a simple kernel that do a dot product.
Using cudaEventRecord I’m measuring latency of storing data with pageable memory (cudaMalloc), pagelocked or pinned (cudaMallocHost) and zerocopy (cudaHostGetDevicePointer)
I was expecting that zero copy would be the fastest, but for N >~ 50000, the pagelocked not mapped is faster.
Could the pagelocked be faster than zero copy? why? Or I’ve made some mistake?

Example: some times (expressed in milliseconds) with the three types of memory, for different numbers of data input(generate randomly)
the average and the error is taken with 300 different run for the same number of data
|Pageable|PageLocked____|Zero Copy
N_Data_|Mean____err____|Mean____err___|Mean____err
512____|0,23____0,02___|0,16____0,01___|0,06____0,01
1024___|0,23____0,02___|0,16____0,01___|0,07____0,01
2048___|0,26____0,02___|0,17____0,01___|0,07____0,01
4096___|0,32____0,02___|0,19____0,01___|0,09____0,01
10240__|0,49____0,02___|0,25____0,01___|0,14____0,01
51200__|1,58____0,06___|0,64____0,03___|0,65____0,03
76800__|2,28____0,05___|0,88____0,04___|0,96____0,04
92160__|2,70____0,06___|1,03____0,04___|1,15____0,05
102400_|2,97____0,09___|1,11____0,06___|1,27____0,08
128000_|3,60____0,17___|1,34____0,07___|1,58____0,10
307200_|6,74____0,23___|3,06____0,20___|3,77____0,26
512000_|10,13___0,48___|5,01____0,34___|6,23____0,42

DMA from pinned memory has two slight advantages in that it (i) doesn’t need to hide the huge latency of the PCIe bus in the kernel and (ii) memory accesses are strictly sequential, allowing max. bandwidth from the SDRAM. At small sizes these these are however outweighed by the extra copy step.

Thank you very much.

But i’m still confused: I was thinking that zero copy was a type of pinned memory. shouldn’t have the same two properties of the pinned that you described me?

Yes, memory needs to be pinned for mapping it into the GPU address space (zero-copy). However, zero-copy memory cannot be accessed in a strictly linear pattern because the accesses come from multiple blocks executing in parallel with unpredictable timing. Latency is also more important for reading zero-copy memory because the memory transactions are only initiated when the kernel actually needs the data, while with DMA is started before the kernel executes.

Perfect! Thank you very much.
do you know if there are guides that fully describe the differente type of memory, and hardware and sofware architecture?
NVIDIA Cuda programming giude is too generic.

This paper reveals quite a bit of undocumented detail through reverse engineering: Demystifying GPU Microarchitecture through Microbenchmarking.

Apart from that, I’ve got my knowledge of CUDA from the Programming Guide and this forum (and my own experience with CUDA of course). But then I’ve been into chip design previously, so the CUDA concepts usually go along in my head with some mental picture of how I might have implemented them myself.