When does a GPU run end Aka when is it safe to copy data

Since kernel-launches are asynchronous, I wonder: how do I know/ensure that memcpy doesn’t copy data that the kernel hasn’t had time to touch yet?

You don’t need to do anything. A cudaMemcpy to/from the host will implicitly wait for all previous asynchronous operations to complete. A device to device memcpy will be inserted in the queue of async operations and run in order.

Thanks I didn’t see it stated that clearly in the manual, so I wanted to check :)