cudaMallocHost() vs cudaHostAlloc(cudaHostAllocPortable)

There seems to be quite a bit of confusion online about the meaning of cudaHostAllocPortable.
I searched the forum, but found no definitive answer.

When do I have to use cudaHostAlloc(cudaHostAllocPortable), and when is cudaMallocHost() enough?
Is this an issue of using multiple CPU threads, or is this an issue of using multiple GPU devices?

Specifically, if I am controlling multiple GPU devices from a single CPU thread,
if I call cudaMallocHost() is the memory pinned for all GPU devices,
or should I really call cudaHostAlloc(cudaHostAllocPortable)?

I suspect that this is something that evolved with the versions of CUDA.
What is the situation in CUDA 5?

I quote the CUDA C Programming Guide, June 2013. Portable Memory
A block of page-locked memory can be used in conjunction with any device in the system (see Multi-Device System for more details on multi-device systems), but by default, the benefits of using page-locked memory described above are only available in conjunction with the device that was current when the block was allocated (and with all devices sharing the same unified address space, if any, as described in Unified Virtual Address Space). To make these advantages available to all devices, the block needs to be allocated by passing the flag cudaHostAllocPortable to cudaHostAlloc() or page-locked by passing the flag cudaHostRegisterPortable to cudaHostRegister().