I am trying to work out the most efficient way to overcome the following problem:
I have a multi-gpu array V, held over 8 gpus (2 IOH with 4 pci each), which is structured as follows:
V.i contains an n1 vector of elements held on gpu i, so we have V.0, V.1, V.2… V.7 all n1 vectors
I’m trying to calculate the sum S = (V.0+V.1+V.2+…V.7) and have it distributed to each of the 8 gpus in the least number of steps (a step being one batch of concurrent pci transfers) as possible. This isn’t so much a coding question as a logic question, I guess, but I was wondering if anyone has had any experience with this in the past. A naiive implementation can do this in 7 steps, any better would be great! :)
Actually, with the two IO Hubs, this problem gets pretty complicated.
I assume that GPUs 0-3 are on one IOHub, and therefore can use Unified Virtual Addressing to read each other’s memory. I assume the same for GPUs 4-7. I think for this to reach full performance, I am also assuming your motherboard has PCI-Express switches joining GPUs 0-1, 2-3, 4-5, 6-7, and you have enough CPU memory bandwidth to transfer to two GPUs at full PCI-Express speed (probably true on a motherboard like this). In that situation, every step below involves PCI-Express transfers that could all run concurrently.
Step 1: Simultaneously compute
GPU 0: S.01 = V.0 + V.1 [fetch from GPU 1 in sum kernel]
GPU 2: S.23 = V.2 + V.3 [fetch from GPU 3 in sum kernel]
GPU 4: S.45 = V.4 + V.5 [fetch from GPU 5 in sum kernel]
GPU 6: S.67 = V.6 + V.7 [fetch from GPU 7 in sum kernel]
Step 2: Simultaneously compute
GPU 0: S.0123 = S.01 + S.23 [fetch from GPU 2 in sum kernel]
GPU 4: S.4567 = S.45 + S.67 [fetch from GPU 6 in sum kernel]
Step 4: Compute final sum in both IOH domains
GPU 0: S = S.0123 + S.4567 [fetch from CPU pinned memory]
GPU 4: S = S.4567 + S.0123 [fetch from CPU pinned memory]
[Note: Edited next two steps to resolve tera’s comments below]
Step 5:
GPU 2: Fetch S from GPU 0
GPU 6: Fetch S from GPU 4
Step 6:
GPU 1: Fetch S from GPU 0
GPU 3: Fetch S from GPU 2
GPU 5: Fetch S from GPU 4
GPU 7: Fetch S from GPU 6
Done! Only took 6 steps.
What was your naive solution that took 7 steps? My naive first solution took 14. :)