I’m a beginner in GPU computing and CUDA
recently I’m reading about memory allocation for CUDA using pinned memory and zero-copy memory
and I want to use it for developing mesh-free (particle) fluid solver. let say the number of the particles in simulation is from 20K particles and could reach 300K-500K particles
for the hardware, I’m using a Nvidia GTX590 (2device in 1 card) with 6Giga CPU memory
i’m wanting maximize these 2devices for the simulation (multi GPU) and i think zero-copy cudaHostAlloc has advantages
so my questions here:
is there any limitation using zero-copy cudaHostAlloc in allocating 300K-500K particles in double type? or what is the predicted limit for double precision?
I’m kind of worrying using pinned memory because i can’t use the virtual memory available
if the memory for particles very large, is the malloc+cudaMalloc has advantage over zero-copy using cudaHostAlloc?
It depends on a couple of factors:
How soon are you going to xchange data between host and device?
Is the date read-only-once in your kernels.
For 1 transfer-in 1 transfer-out malloc+cudaMalloc+memcpy is faster than cudaHostAlloc+cudaMalloc+memcpy for a dataset of size less than 768.
If your data is read-only-once in your kernels Zero copy could be from 16 to 28% faster than malloc+cudaMalloc+memcpy.
This according to my measurements with P45 and X48 chipset.
so it is a time-evolution simulation (unsteady in computational fluid dynamics)
it involves n-body simulation + simulation of diffusion for each time step
the number of mesh (particularly particles in my simulation) can increase as the time evolves
so i could generate zero data in the beginning or put some value, and the data will evolves later
Since the number of data will increase I will need re-size the memory as needed. the size of data itself is not fixed as in grid/mesh based simulation because it is a particle simulation
so is it better to put the mesh/particles all the time in the GPU or should i communicate back and forth between CPU-GPU?
and it also need to writing data every time step/several time steps. what do you think?
Using GPU memory would most probably be faster than page-locked memory since it is an iterative algorithm if you don’t want intermediate results on the host.
Also If you dont want intermediate results on the host i dont think you need page-locked memory either.