Host to multiple device transfers

I need to transfer the same data to multiple devices at the same time. Ideally, I’d like to exploit PCI-Express multicast but this doesn’t seem to be available in CUDA at the moment. I think the next best thing is to use portable pinned memory and transfer to each GPU from there but is there any problem with multiple devices trying to use the same portable pinned memory at the same time? Would I see any benefit from using the write combined flag (assuming that I am going to use cudaMemcpy for the transfers)? I presume that there isn’t anything in the hardware itself (e.g. in the PCI-Express bridges) that would be smart enough to realise that it doesn’t need to request the same data from the host twice? Alternatively, with GPUDirect I think its possible to write directly to GPU memory from another hardware device. Is multicast possible in this case? If so, is there any way to set up a multicast transfer from some kind of virtual hardware device driver or something?

I’m loving CUDA 4.1 RC2 by the way!