zero copy using cudaHostAlloc vs normal malloc+cudaMalloc

I’m a beginner in GPU computing and CUDA
recently I’m reading about memory allocation for CUDA using pinned memory and zero-copy memory
and I want to use it for developing mesh-free (particle) fluid solver. let say the number of the particles in simulation is from 20K particles and could reach 300K-500K particles
for the hardware, I’m using a Nvidia GTX590 (2device in 1 card) with 6Giga CPU memory
i’m wanting maximize these 2devices for the simulation (multi GPU) and i think zero-copy cudaHostAlloc has advantages

so my questions here:

  1. is there any limitation using zero-copy cudaHostAlloc in allocating 300K-500K particles in double type? or what is the predicted limit for double precision?
    I’m kind of worrying using pinned memory because i can’t use the virtual memory available

  2. if the memory for particles very large, is the malloc+cudaMalloc has advantage over zero-copy using cudaHostAlloc?



Pinned memory is limited phisically by the system DRAM, but the general performance of the system won’t be good. The same will occur for zero-copy memory.

In your case, if the data fits in the GPU RAM, i think the cudaMalloc is the best option.

Moreover, i think that the mesh will stay in the GPU memory all time, the copy of the memory will occur only once at the beginning of the problem, does not?


It depends on a couple of factors:
How soon are you going to xchange data between host and device?
Is the date read-only-once in your kernels.

For 1 transfer-in 1 transfer-out malloc+cudaMalloc+memcpy is faster than cudaHostAlloc+cudaMalloc+memcpy for a dataset of size less than 768.
If your data is read-only-once in your kernels Zero copy could be from 16 to 28% faster than malloc+cudaMalloc+memcpy.
This according to my measurements with P45 and X48 chipset.

so it is a time-evolution simulation (unsteady in computational fluid dynamics)
it involves n-body simulation + simulation of diffusion for each time step
the number of mesh (particularly particles in my simulation) can increase as the time evolves

so i could generate zero data in the beginning or put some value, and the data will evolves later
Since the number of data will increase I will need re-size the memory as needed. the size of data itself is not fixed as in grid/mesh based simulation because it is a particle simulation
so is it better to put the mesh/particles all the time in the GPU or should i communicate back and forth between CPU-GPU?
and it also need to writing data every time step/several time steps. what do you think?

Using GPU memory would most probably be faster than page-locked memory since it is an iterative algorithm if you don’t want intermediate results on the host.
Also If you dont want intermediate results on the host i dont think you need page-locked memory either.

what about zero memory copy with uva for multi devices? is it faster than host to device memory copy with uva?