Multiple memcpu HostToDevice in parallel ? or how to fake broadcast to several GPU

Suppose I have several GPUs connected. Each serviced by its own thread, having its private GPU address space.

To broadcast a big buffer to the GPUs, I can :

0- Dream to use a broadcast API from CUDA, yet missing…

1- Ask each server to upload the buffer to its private space. Because they are threads, all memcpy will run in parallel.

2- Sync each service thread after each upload, hence doing mempcy one after the other.

Questions:

  • Will 0 (broadcast) be supported in some futur release of CUDA ?

  • Is 1 (concurrent DMA) safe ? Or should I except some crash…

  • Is 1 faster than 2, because anyway the bus is the bottleneck and there is no gain in sending several concurrent DMA transferts.

  • Yes, but if “several” is 12 (eg. 3 S870 connected by 3 PCIex16), I do have several busses that should be able to do concurrent DMAs.
    So the good question is: when 1 is faster than 2 ?