cuda memcpy

if I want cp some value between device to device many time whats the best approach?

for (loop I time){

cudamemcpy(d(I),s,few_byte, d2d)

}
eventually copy entire D array to host after loop.

should use async mem copy in the loop to speed up?

For what purpose do you want to do this? What are you trying to accomplish? This has the appearance of an XY problem to me.

The key to high performance code is typically to avoid data movement as much as possible. This applies in particular to the movement of data that does not also involve some form of data processing, i.e. “pure” copies like in the question.

the source data has to be processed and offset, the new value need store in a temp array in GPU as final value, at the end of loop, array is copy from device to host. I can’t unroll the loop because too many iterations.

That is an extremely vague description, basically paraphrasing the earlier pseudo-code. Why can’t the “few bytes” be moved as part of the whatever kernels are running on the GPU? Why are separate API calls from the host necessary?

How many asychronous memory copies can run in prallel? why? can u please answer