Asynchronous copy and Memory allocation for time evolving simulation

I’m working on numerical method that deal with time evolution simulation and I wanna try to use the asynchronous memory copy to accelerate my simulation
however I get confused recently because the asynchronous copy needs the memory to be located on the host and I will make the memory usage twice than the serial code (for the worst case). I consider the memory as crucial in my simulation because it will deal with 100K or more grids/particles
so any suggestion what should I do?
should I use the page-locked memory for both host and device? is there any hindrance if i use zero-copy memory?
or should I use the synchronous memory copy?

Hi xnov,

Since your simlation consumes many memory, it is not good to allocate them all in page-locked memory on host. Maybe you can use pagable memory for one group of data, initialize the data on host and copy then to device use cudaMemcpyAsync, when the copy is going on you can initialize another gourp of data on the same pagable memory. Because the cudaMemcpyAsync function returns when data finished copying to memory in driver and the driver will copy it to device. BTW, the memory copy between host and device can not overlap on the same direction. So in my opinion if the initialization of data group is time consuming several host thread can be used to initialize them and copy them to device simultaneously, the main thread waits for all sub threads exiting and synchronizes to memory copy finish, then the next round of initialization and copy could start. When data are ready for all GPU, do the calculation, copy the result back.

I am not sure this method is good for you, can others give more advanced opinion?

Best regards!