Asynchronicity of kernel execution and cuMemcpy

Is it possible that a cuMemcpy from device to host can be running at the same time a kernel on that device can be executing?

If so, how can the host and device communicate so that the kernel does not write to the memory the cuMemcpy is reading until the cuMemcpy has finished?

If the kernel execution and the memcpy are in the same stream (if you don’t specify a stream, they are implicitly in stream 0), then you don’t need to worry. cuMemcpy automatically performs the synchronization required ensure consistency.

Streams are for me the next thing to understand in CUDA.