Cuda sample - simple MultiCopy result

I have a question about the simpleMultiCopy result value in the cuda sample.

Our developer performed simple MultiCopy using HPE apolo gen10 plus (pci 4.0 x 16) and a single nVIDIA A100 40GB (pci type).
(windows server 2022, cuda 12)

The results are as follows.
However, the developer said that the value of fully serialized execution and overlapping 4 streams should be doubled.

HPE apollogen10 (pci 3.0 x 16), about twice as much in the V100 environment.
fully serialized execution : 12GB/s
overlapped using 4 streams : 24GB/s

I also tested it in amd 5600x, rtx3070 environment and confirmed that the overlapped value does not double.

I’m new to gpu programming, so I’m not sure if the above value is double or not.

Please let me know your opinion.
thank you~

