cudaDeviceSynchronize() doesn't wait for cudaMemcpy to finish?

My code flow looks like the following
cudaMemcpy( dst, src, size, cudaMemcpyDeviceToDevice );
cudaDeviceSynchronize();
KernelCall

The above flow doesn’t work all the time and fails almost for 50% of calls giving incorrect values in the device pointer causing my cuda kernel to fail.

However, when I add a sleep between cudaMemcpy and kernel call, it works fine all the time.
cudaMemcpy( dst, src, size, cudaMemcpyDeviceToDevice );
usleep(200000); // sleep for 0.2 seconds.
KernelCall

the dst and src are arrays if int32_t of size 117,000,000.
The GPU where this program fails is NVIDIA Tesla P40.
The same program works fine without usleep or cudaDeviceSynchronize when using NVIDIA GeForce GTX 1080 so it seems to be device specific but there can be other factors involved.
I did make sure the driver versions were same on both GPUs

I’m not sure why cudaDeviceSynchronize doesn’t work but usleep() works.

Also, whenever the program fails, I see the same value “1016296637” stored in all the locations on the device side.

Is is possible that cudaDeviceSynchronize doesn’t make CPU wait for the cudaMemcpy to finish?

Do you check for cuda runtime api errors?
Does cuda-memcheck report any errors?

I used cuda-memcheck and couldn’t find any errors with memcheck, racecheck or synccheck.
Initcheck gave me this issue
========= Host API memory access error at host access to 0x7fa4e2e274d8 of size 32792 bytes
========= Uninitialized access at 0x7fa4e2e2d500 on access by cudaMemcopy source.

You would want to fix the issues reported by cuda-memcheck right away.

Note that cudaMemcpy() is a synchronous API call: control returns to the host code after the copy is complete [see CUDA documentation]. A call to cudaDeviceSynchronize() is not needed here.