Implicit Synchronization

I have some questions over the following paragraph from Programming Guide:

"Two commands from different streams cannot run concurrently if any one of the following operations is issued in-between them by the host thread:

a page-locked host memory allocation,
a device memory allocation,
a device memory set,
a memory copy between two addresses to the same device memory,
any CUDA command to the NULL stream,
a switch between the L1/shared memory configurations described in Compute Capability 3.x and Compute Capability 7.x."

  1. What exactly is the behavior when “Two commands from different streams cannot run concurrently”?

Are they serialized with the in-between command?

If so, do they also become synchronous with respect to host?

  1. What exactly does “a memory copy between two addresses to the same device memory” mean? “Device memory” same with who?
  1. Two commands (CUDA operations) that cannot run concurrently are serialized. Rather than operation A and operation B executing at the same time, first one will execute, then the other. It doesn’t necessarily mean that they become synchronous with the host. For example, a cudaMalloc operation (“a device memory allocation”) issued between 2 kernel calls, even if the kernel calls are issued to separate streams, will prevent those 2 kernels from executing concurrently.

  2. This is referring to a memory copy between two device memory addresses on the same device. i.e the same GPU. For example a cudaMemcpy operation where the direction token is cudaMemcpyDeviceToDevice, and both supplied pointers (source and destination) refer to locations on the same device.

Thanks for the clarification!