Multi gpu reduction + broadcast of array

I am trying to work out the most efficient way to overcome the following problem:

I have a multi-gpu array V, held over 8 gpus (2 IOH with 4 pci each), which is structured as follows:
V.i contains an n1 vector of elements held on gpu i, so we have V.0, V.1, V.2… V.7 all n1 vectors
I’m trying to calculate the sum S = (V.0+V.1+V.2+…V.7) and have it distributed to each of the 8 gpus in the least number of steps (a step being one batch of concurrent pci transfers) as possible. This isn’t so much a coding question as a logic question, I guess, but I was wondering if anyone has had any experience with this in the past. A naiive implementation can do this in 7 steps, any better would be great! :)

Actually, with the two IO Hubs, this problem gets pretty complicated.

I assume that GPUs 0-3 are on one IOHub, and therefore can use Unified Virtual Addressing to read each other’s memory. I assume the same for GPUs 4-7. I think for this to reach full performance, I am also assuming your motherboard has PCI-Express switches joining GPUs 0-1, 2-3, 4-5, 6-7, and you have enough CPU memory bandwidth to transfer to two GPUs at full PCI-Express speed (probably true on a motherboard like this). In that situation, every step below involves PCI-Express transfers that could all run concurrently.

Step 1: Simultaneously compute
GPU 0: S.01 = V.0 + V.1 [fetch from GPU 1 in sum kernel]
GPU 2: S.23 = V.2 + V.3 [fetch from GPU 3 in sum kernel]
GPU 4: S.45 = V.4 + V.5 [fetch from GPU 5 in sum kernel]
GPU 6: S.67 = V.6 + V.7 [fetch from GPU 7 in sum kernel]

Step 2: Simultaneously compute
GPU 0: S.0123 = S.01 + S.23 [fetch from GPU 2 in sum kernel]
GPU 4: S.4567 = S.45 + S.67 [fetch from GPU 6 in sum kernel]

Step 3: Device-to-host copies
GPU 0 → CPU: S.0123
GPU 4 → CPU: S.4567

Step 4: Compute final sum in both IOH domains
GPU 0: S = S.0123 + S.4567 [fetch from CPU pinned memory]
GPU 4: S = S.4567 + S.0123 [fetch from CPU pinned memory]

[Note: Edited next two steps to resolve tera’s comments below]

Step 5:
GPU 2: Fetch S from GPU 0
GPU 6: Fetch S from GPU 4

Step 6:
GPU 1: Fetch S from GPU 0
GPU 3: Fetch S from GPU 2
GPU 5: Fetch S from GPU 4
GPU 7: Fetch S from GPU 6

Done! Only took 6 steps.

What was your naive solution that took 7 steps? My naive first solution took 14. :)

This looks like the optimal solution to me too. However aren’t the GPUs permuted in steps 5 and 6? I would expect them to be

Step 5:
GPU 2: Fetch S from GPU 0
GPU 6: Fetch S from GPU 4

Step 6:
GPU 1: Fetch S from GPU 0
GPU 3: Fetch S from GPU 2
GPU 5: Fetch S from GPU 4
GPU 7: Fetch S from GPU 6

in order for the transfers in the last step to be able to run in parallel.

Ah, right. I did mess that up at the end. I’ve fixed the solution above.