zero copy using cudaHostAlloc vs normal malloc+cudaMalloc

xnov · April 27, 2012, 7:47am

I’m a beginner in GPU computing and CUDA
recently I’m reading about memory allocation for CUDA using pinned memory and zero-copy memory
and I want to use it for developing mesh-free (particle) fluid solver. let say the number of the particles in simulation is from 20K particles and could reach 300K-500K particles
for the hardware, I’m using a Nvidia GTX590 (2device in 1 card) with 6Giga CPU memory
i’m wanting maximize these 2devices for the simulation (multi GPU) and i think zero-copy cudaHostAlloc has advantages

so my questions here:

is there any limitation using zero-copy cudaHostAlloc in allocating 300K-500K particles in double type? or what is the predicted limit for double precision?
I’m kind of worrying using pinned memory because i can’t use the virtual memory available
if the memory for particles very large, is the malloc+cudaMalloc has advantage over zero-copy using cudaHostAlloc?

alex

insmvb00 · April 27, 2012, 11:57am

Hi!

Pinned memory is limited phisically by the system DRAM, but the general performance of the system won’t be good. The same will occur for zero-copy memory.

In your case, if the data fits in the GPU RAM, i think the cudaMalloc is the best option.

Moreover, i think that the mesh will stay in the GPU memory all time, the copy of the memory will occur only once at the beginning of the problem, does not?

Regards!

apostglen46 · April 27, 2012, 6:52pm

It depends on a couple of factors:
How soon are you going to xchange data between host and device?
Is the date read-only-once in your kernels.

For 1 transfer-in 1 transfer-out malloc+cudaMalloc+memcpy is faster than cudaHostAlloc+cudaMalloc+memcpy for a dataset of size less than 768.
If your data is read-only-once in your kernels Zero copy could be from 16 to 28% faster than malloc+cudaMalloc+memcpy.
This according to my measurements with P45 and X48 chipset.

xnov · April 28, 2012, 12:43pm

so it is a time-evolution simulation (unsteady in computational fluid dynamics)
it involves n-body simulation + simulation of diffusion for each time step
the number of mesh (particularly particles in my simulation) can increase as the time evolves

so i could generate zero data in the beginning or put some value, and the data will evolves later
Since the number of data will increase I will need re-size the memory as needed. the size of data itself is not fixed as in grid/mesh based simulation because it is a particle simulation
so is it better to put the mesh/particles all the time in the GPU or should i communicate back and forth between CPU-GPU?
and it also need to writing data every time step/several time steps. what do you think?

apostglen46 · April 28, 2012, 1:57pm

Using GPU memory would most probably be faster than page-locked memory since it is an iterative algorithm if you don’t want intermediate results on the host.
Also If you dont want intermediate results on the host i dont think you need page-locked memory either.

xnov · May 2, 2012, 6:23pm

what about zero memory copy with uva for multi devices? is it faster than host to device memory copy with uva?

Topic		Replies	Views
Pinned Memory zero copy No-copy pinning of system memory CUDA Programming and Performance	3	1107	December 1, 2011
cudahostalloc vs memcpy tradeoff CUDA Programming and Performance	1	1389	November 24, 2014
Advantages/Disadvantages of using pinned memory CUDA Programming and Performance	6	13727	May 4, 2018
Zero Copy VS Page-Locked CUDA Programming and Performance	5	1142	September 19, 2011
cudaFreeHost consistently 20x slower than free/cudaFree (full runnable example code available) CUDA Programming and Performance	5	981	July 26, 2022
Is cudaHostAlloc() fast? CUDA Programming and Performance	5	631	March 28, 2024
Why is cudaMallocHost() so slow? CUDA Programming and Performance	7	8871	November 17, 2021
malloc() + cuMemHostRegister() faster than cuMemAllocHost() CUDA Programming and Performance	0	1084	October 9, 2013
Using cudaHostAlloc CUDA Programming and Performance	0	6494	May 9, 2011
Fast processing of large amounts of pinned memory CUDA Programming and Performance	2	725	August 29, 2017

zero copy using cudaHostAlloc vs normal malloc+cudaMalloc

Related topics