Cuda Device to Device Copy with Host Side Synchronization

Hi, I am doing device to device cudaMemcpy. In my code I have created two threads.
Thread 1 - Copies data from device memroy to another device memory
Thread 2 - Operates on this copied memory.

On CPU program, how should I come to know that Thread-1 has completed the memcpy job before I instruct another thread to process on the latest data and not on the previously holded data / junk data in the buffer.

As per, “For transfers from device memory to device memory, no host-side synchronization is performed.”, so can you please help me in understanding how to handle this situation. If you point out to any reference code, that would be helpful.