Zero Copy VS Page-Locked

Stefano_Gelain · September 14, 2011, 3:06am

Hi to everybody!
I’m an Italian student and I’m new with CUDA and graphic boards and I’m doing some studies with GPU, in particular regarding latency with different types of memory and data storing.
I generate randomly a set of N data (numbers), then I launch a simple kernel that do a dot product.
Using cudaEventRecord I’m measuring latency of storing data with pageable memory (cudaMalloc), pagelocked or pinned (cudaMallocHost) and zerocopy (cudaHostGetDevicePointer)
I was expecting that zero copy would be the fastest, but for N >~ 50000, the pagelocked not mapped is faster.
Could the pagelocked be faster than zero copy? why? Or I’ve made some mistake?

Example: some times (expressed in milliseconds) with the three types of memory, for different numbers of data input(generate randomly)
the average and the error is taken with 300 different run for the same number of data
|Pageable|PageLocked____|Zero Copy
N_Data_|Mean____err____|Mean____err___|Mean____err
512____|0,23____0,02___|0,16____0,01___|0,06____0,01
1024___|0,23____0,02___|0,16____0,01___|0,07____0,01
2048___|0,26____0,02___|0,17____0,01___|0,07____0,01
4096___|0,32____0,02___|0,19____0,01___|0,09____0,01
10240__|0,49____0,02___|0,25____0,01___|0,14____0,01
51200__|1,58____0,06___|0,64____0,03___|0,65____0,03
76800__|2,28____0,05___|0,88____0,04___|0,96____0,04
92160__|2,70____0,06___|1,03____0,04___|1,15____0,05
102400_|2,97____0,09___|1,11____0,06___|1,27____0,08
128000_|3,60____0,17___|1,34____0,07___|1,58____0,10
307200_|6,74____0,23___|3,06____0,20___|3,77____0,26
512000_|10,13___0,48___|5,01____0,34___|6,23____0,42

tera · September 14, 2011, 5:48pm

DMA from pinned memory has two slight advantages in that it (i) doesn’t need to hide the huge latency of the PCIe bus in the kernel and (ii) memory accesses are strictly sequential, allowing max. bandwidth from the SDRAM. At small sizes these these are however outweighed by the extra copy step.

Stefano_Gelain · September 15, 2011, 9:08pm

Thank you very much.

But i’m still confused: I was thinking that zero copy was a type of pinned memory. shouldn’t have the same two properties of the pinned that you described me?

tera · September 16, 2011, 3:48am

Yes, memory needs to be pinned for mapping it into the GPU address space (zero-copy). However, zero-copy memory cannot be accessed in a strictly linear pattern because the accesses come from multiple blocks executing in parallel with unpredictable timing. Latency is also more important for reading zero-copy memory because the memory transactions are only initiated when the kernel actually needs the data, while with DMA is started before the kernel executes.

Stefano_Gelain · September 16, 2011, 10:11pm

Perfect! Thank you very much.
do you know if there are guides that fully describe the differente type of memory, and hardware and sofware architecture?
NVIDIA Cuda programming giude is too generic.

tera · September 19, 2011, 1:29am

This paper reveals quite a bit of undocumented detail through reverse engineering: Demystifying GPU Microarchitecture through Microbenchmarking.

Apart from that, I’ve got my knowledge of CUDA from the Programming Guide and this forum (and my own experience with CUDA of course). But then I’ve been into chip design previously, so the CUDA concepts usually go along in my head with some mental picture of how I might have implemented them myself.

Topic		Replies	Views
Page Locked Memory CUDA Programming and Performance	3	977	May 5, 2011
question about page locked memory CUDA Programming and Performance	2	8555	April 21, 2009
Pinned memory size problem CUDA Programming and Performance	4	3870	December 11, 2009
Zero Copy performance problem CUDA Programming and Performance	4	2045	July 6, 2021
Page-locked memory CUDA Programming and Performance	9	9061	April 8, 2009
Advantages/Disadvantages of using pinned memory CUDA Programming and Performance	6	13146	May 4, 2018
zero copy using cudaHostAlloc vs normal malloc+cudaMalloc CUDA Programming and Performance	5	4930	May 2, 2012
Could someone compile simple example for me on the mobile card? CUDA Programming and Performance	20	10175	November 11, 2009
Is it possible to use pinned memory? Outside of CUDA CUDA Programming and Performance	7	6172	February 14, 2008
Memory-type quesions CUDA Programming and Performance	7	416	April 21, 2023

Zero Copy VS Page-Locked

Related topics