Why All-gather write (n-1)/n of communication data for 2 times?

Hi, I have a question about All-Gather of NCCL.
I use Nsight system to profile All-Gather w/ different message size of communication data (from 64M to 12G ).

  • Firstly, I collect the DRAM Bandwidth utilization of each All-Gather from Nsight system.
  • Then I calculate the data size that All-Gather read and write to DRAM (read data size = read B/W util * (2TB/s) * kernel duration, write data size is calculated the same).
  • Finally, I found read data is the the same as communication data. But write data is different.
    out-of-place write data size = (2(n-1)/n + 1/n) * comm data
    in-place write data size = (2(n-1)/n) * comm data
    (n is the GPU number)

I don’t know why the (n-1)/n data is wrote by 2 times in both out-of-place and in-place scenario.
In my understanding, both out-of-place and in-place should write (n-1)/n data for 1 time.