cudaHostAlloc caching behavior

When the GPU does a load-store on a memory region that is allocated via cudaHostAlloc, it fetches the data from the host memory on the PCIe bus. Is this understanding correct?

If so,

  1. What is the granularity of the fetch? Is it word level or at cache line level? That is, do neighboring load misses fetched as two separate loads on the PCIe bus?

  2. Is this data that is fetched cached on the GPU? I would imagine for coherency reasons, this data may not be cached at all. But, is there a way to give up coherency and make it cacheable on the device?

For question 2, what is the behavior on unified memory? Especially when the data is from a remote device.