Suppose I have several GPUs connected. Each serviced by its own thread, having its private GPU address space.
To broadcast a big buffer to the GPUs, I can :
0- Dream to use a broadcast API from CUDA, yet missing…
1- Ask each server to upload the buffer to its private space. Because they are threads, all memcpy will run in parallel.
2- Sync each service thread after each upload, hence doing mempcy one after the other.
Questions:
-
Will 0 (broadcast) be supported in some futur release of CUDA ?
-
Is 1 (concurrent DMA) safe ? Or should I except some crash…
-
Is 1 faster than 2, because anyway the bus is the bottleneck and there is no gain in sending several concurrent DMA transferts.
-
Yes, but if “several” is 12 (eg. 3 S870 connected by 3 PCIex16), I do have several busses that should be able to do concurrent DMAs.
So the good question is: when 1 is faster than 2 ?