A100 simplemulticopy

mile1404 · August 20, 2024, 1:57am

I used SimpleMulti copy to test A100 performance and it came out like this

fully serialized execution : 25gb/s
overlapped using 4 streams : 29gb/s

I think the overlap result is low, what should I check?

If anyone has tested it, can you let me know the results?

striker159 · August 20, 2024, 8:25am

Why do you think the overlap result is low? What values did you expect for your machine?

mile1404 · August 21, 2024, 7:44am

50gb/s?
The customer said that if they used v100, it came out as 12/24 gb/s.
But the a100 came out too low. (The server model with v100 and a100 is different. a100 is more up-to-date.)
Actually I don’t know gpu very well. I asked here because it didn’t come out even if I Googled it.

njuffa · August 21, 2024, 8:27am

I am not familiar with the program you are using, and have use neither V100 or A100. Is this a controlled experiment, i.e. one and the same physical system and software stack, with a single GPU, and this single GPU is either V100 or A100?

The result from the V100 looks like its limited purely by the PCIe gen3 x16 interface in the GPU, so 12 GB/sec for uni-directional traffic and 24 GB/sec for bi-directional traffic.

The A100 has a PCIe4 interface, so 25 GB/sec for unidirectional traffic are expected, and since PCIe supports full duplex operation, we would expect to see that doubled for bi-directional traffic. Which is not observed.

Assuming it is not an issue with the design of the test app (the app may not be suitable for benchmarking!) it looks like there is some other kind of bottleneck. What does the host system look like? One or two CPU sockets, what kind of CPU? How many channels of DDR4 or DDR5 and what speed grade are used for system memory?

Robert_Crovella · August 21, 2024, 2:09pm

The application in question is here. In a nutshell, it does this process:

H->D copy from pinned to device memory
run a kernel
D->H copy from device memory to pinned

In the first phase of testing it runs that 1-2-3 operation 10 times, serially (i.e. all steps in the same stream.)

In the second phase of testing it runs that operation 10 times, but using 4 streams. It then computes the speedup. (I don’t know why somebody thought it would be a good idea to put 10 operations in 4 streams.)

The code was written a long time ago, so one of the deficiencies is that the kernel execution time on modern GPUs is quite short (e.g. ~100us vs. potentially millisecond or longer copy times). But we can leave that aside for the moment.

Even if we discount the kernel, the operation should be able to achieve almost 2x speedup from phase 1 to phase 2, because of overlap of step 1 in a particular stream with step 3 in another stream.

If you’re not witnessing this, a profiling step may be in order. However one thing that can cause this behavior (much less than 2x speedup observed) is severely limited host memory bandwidth. When we get to phase 2, the host memory bandwidth must be sufficient to support two transfer activities at the same time, in order to witness the speedup. If you have less than that, it will show up as increased duration of the transfers in the second, overlapped phase. Here is what that looks like:

This is on a R730 system that has a single L4 GPU and its host memory system is literally a single 32GB DIMM. So the system is horribly misconfigured for high-performance GPU usage. In the above diagram, I have hovered the mouse over the first transfer in phase 1 so that you can see the transfer specifics. The first transfer in phase 2 is basically twice as long.

So I cannot say what is going on in the system in question. But these are some considerations when trying to interpret the results of that sample code. To go to the next step, profiling is important to confirm the basic behavior (i.e. proximal root cause - the transfers may be taking twice as long - as you see in my example), and then based on the profiler output, it may be necessary to investigate the root cause for example based on some of the info requested by njuffa.

Robert_Crovella · August 21, 2024, 2:16pm

For reference, here is what the program output looks like on my “lousy” L4 system:

# ./simpleMultiCopy                     [simpleMultiCopy] - Starting...
> Using CUDA device [0]: NVIDIA L4
[NVIDIA L4] has 58 MP(s) x 128 (Cores/MP) = 7424 (Cores)
> Device name: NVIDIA L4
> CUDA Capability 8.9 hardware with 58 multi-processors
> scale_factor = 1.00
> array_size   = 4194304


Relevant properties of this CUDA device
(X) Can overlap one CPU<>GPU data transfer with GPU kernel execution (device property "deviceOverlap")
(X) Can overlap two CPU<>GPU data transfers with GPU kernel execution
    (Compute Capability >= 2.0 AND (Tesla product OR Quadro 4000/5000/6000/K5000)

Measured timings (throughput):
 Memcpy host to device  : 2.008512 ms (8.353057 GB/s)
 Memcpy device to host  : 2.512256 ms (6.678148 GB/s)
 Kernel                 : 0.056736 ms (2957.067124 GB/s)

Theoretical limits for speedup gained from overlapped data transfers:
No overlap at all (transfer-kernel-transfer): 4.577504 ms
Compute can overlap with one transfer: 4.520768 ms
Compute can overlap with both data transfers: 2.512256 ms

Average measured timings over 10 repetitions:
 Avg. time when execution fully serialized      : 4.565504 ms
 Avg. time when overlapped using 4 streams      : 4.480093 ms
 Avg. speedup gained (serialized - overlapped)  : 0.085411 ms

Measured throughput:
 Fully serialized execution             : 7.349557 GB/s
 Overlapped using 4 streams             : 7.489674 GB/s

Robert_Crovella · August 21, 2024, 10:25pm

Here is what the output looks like on another system that has “sufficient” host memory bandwidth:

$ ./simpleMultiCopy
[simpleMultiCopy] - Starting...
> Using CUDA device [0]: NVIDIA GeForce GTX 970
[NVIDIA GeForce GTX 970] has 13 MP(s) x 128 (Cores/MP) = 1664 (Cores)
> Device name: NVIDIA GeForce GTX 970
> CUDA Capability 5.2 hardware with 13 multi-processors
> scale_factor = 1.00
> array_size   = 4194304


Relevant properties of this CUDA device
(X) Can overlap one CPU<>GPU data transfer with GPU kernel execution (device property "deviceOverlap")
(X) Can overlap two CPU<>GPU data transfers with GPU kernel execution
    (Compute Capability >= 2.0 AND (Tesla product OR Quadro 4000/5000/6000/K5000)

Measured timings (throughput):
 Memcpy host to device  : 2.708384 ms (6.194548 GB/s)
 Memcpy device to host  : 2.516736 ms (6.666260 GB/s)
 Kernel                 : 0.370432 ms (452.909481 GB/s)

Theoretical limits for speedup gained from overlapped data transfers:
No overlap at all (transfer-kernel-transfer): 5.595552 ms
Compute can overlap with one transfer: 5.225120 ms
Compute can overlap with both data transfers: 2.708384 ms

Average measured timings over 10 repetitions:
 Avg. time when execution fully serialized      : 5.595418 ms
 Avg. time when overlapped using 4 streams      : 3.240377 ms
 Avg. speedup gained (serialized - overlapped)  : 2.355040 ms

Measured throughput:
 Fully serialized execution             : 5.996770 GB/s
 Overlapped using 4 streams             : 10.355100 GB/s
$

mile1404 · August 22, 2024, 12:55pm

The system configuration is as follows.

OS and IDE : windows server 2022, visual studio 2022
GPU : Nvidia A100 40GB pcie x 4
CPU : AMD 7402 x 2
Memory : DDR4 3200 64GB x 16, support 8 Channel

I’m checking with the server vendor if the GPU configuration or bios setting is wrong.

mile1404 · August 22, 2024, 1:13pm

Thank you for letting me know the results of your test.
There’s something called “nsight”.
I think it’s getting a little harder.

I’ll wait for the server vendor’s response and find out what I can do.

Robert_Crovella · August 22, 2024, 1:24pm

There are a few other things to check:

on windows, I don’t expect that A100 GPUs could or should be in WDDM mode, but if it were my system I would reconfirm they are in TCC mode with nvidia-smi.
on a multi-CPU-socket system of undetermined design, it is probably sensible to run this test with appropriate process pinning. I have an idea how to do that on linux. It can be done on windows, but I won’t be able to give you a recipe.
some servers have settings that determine how system memory is mapped across sockets as well as across DRAM channels. This is potentially important to “get right”. I wouldn’t be able to give a recipe without understanding what the exact SBIOS options are to control it, but again I would want no cross-socket striping, and within a socket I would want striping across DRAM channels

njuffa · August 22, 2024, 10:22pm

@Robert_Crovella already provided a bunch of helpful tips. In systems with two CPUs sockets it is important that each GPU “talks” to the “near” CPU and its attached “near” memory, as transfers across the inter-CPU fabric may suffer from bandwidth limitations in the interconnect and/or latency issues that ultimately reduce effective bandwidth of the interconnect.

As a quick sanity check: How many of the eight DDR4 channels per CPU sockets are actually populated in this system? Ideally it would be all of them and I interpret “DDR4 3200 64GB x 16” to mean that this is in fact the case.

In which case each socket should have about 160 GB/sec of practically achievable DRAM bandwidth available to it, more than enough to simultaneously source and sink a PCIe gen4 x16 link saturated with bi-directional traffic.

Robert_Crovella · August 23, 2024, 2:30pm

Here is the result running on a DGX-A100 system:

# ./t1
[simpleMultiCopy] - Starting...
cuda_device = 0
> Using CUDA device [0]: NVIDIA A100-SXM4-80GB
[NVIDIA A100-SXM4-80GB] has 108 MP(s)
> Device name: NVIDIA A100-SXM4-80GB
> CUDA Capability 8.0 hardware with 108 multi-processors
> scale_factor = 1.00
> array_size   = 4194304


Relevant properties of this CUDA device
(X) Can overlap one CPU<>GPU data transfer with GPU kernel execution (device property "deviceOverlap")
(X) Can overlap two CPU<>GPU data transfers with GPU kernel execution
    (Compute Capability >= 2.0 AND (Tesla product OR Quadro 4000/5000/6000/K5000)

Measured timings (throughput):
 Memcpy host to device  : 0.661504 ms (25.362230 GB/s)
 Memcpy device to host  : 0.774528 ms (21.661212 GB/s)
 Kernel                 : 0.052864 ms (3173.656162 GB/s)

Theoretical limits for speedup gained from overlapped data transfers:
No overlap at all (transfer-kernel-transfer): 1.488896 ms 
Compute can overlap with one transfer: 1.436032 ms
Compute can overlap with both data transfers: 0.774528 ms

Average measured timings over 10 repetitions:
 Avg. time when execution fully serialized      : 1.481747 ms
 Avg. time when overlapped using 4 streams      : 0.890531 ms
 Avg. speedup gained (serialized - overlapped)  : 0.591216 ms

Measured throughput:
 Fully serialized execution             : 22.645179 GB/s
 Overlapped using 4 streams             : 37.679122 GB/s
#

Linyu · August 23, 2024, 11:02pm

Here is my result running on H100. The overlap result is also low.

[simpleMultiCopy] - Starting...
> Using CUDA device [0]: NVIDIA H100 80GB HBM3
[NVIDIA H100 80GB HBM3] has 132 MP(s) x 128 (Cores/MP) = 16896 (Cores)
> Device name: NVIDIA H100 80GB HBM3
> CUDA Capability 9.0 hardware with 132 multi-processors
> scale_factor = 1.00
> array_size   = 8388608


Relevant properties of this CUDA device
(X) Can overlap one CPU<>GPU data transfer with GPU kernel execution (device property "deviceOverlap")
(X) Can overlap two CPU<>GPU data transfers with GPU kernel execution
    (Compute Capability >= 2.0 AND (Tesla product OR Quadro 4000/5000/6000/K5000)

Measured timings (throughput):
 Memcpy host to device	: 0.592192 ms (56.661408 GB/s)
 Memcpy device to host	: 0.615040 ms (54.556503 GB/s)
 Kernel			: 0.055648 ms (6029.764335 GB/s)

Theoretical limits for speedup gained from overlapped data transfers:
No overlap at all (transfer-kernel-transfer): 1.262880 ms
Compute can overlap with one transfer: 1.207232 ms
Compute can overlap with both data transfers: 0.615040 ms

Average measured timings over 10 repetitions:
 Avg. time when execution fully serialized	: 1.256294 ms
 Avg. time when overlapped using 4 streams	: 0.914890 ms
 Avg. speedup gained (serialized - overlapped)	: 0.341405 ms

Measured throughput:
 Fully serialized execution		: 53.418102 GB/s
 Overlapped using 4 streams		: 73.351869 GB/s

njuffa · August 23, 2024, 11:33pm

@Robert_Crovella mentioned that this demo app is quite old. It seems to me that this code is no longer a good demo vehicle for stream overlap, and NVIDIA should either rework the app or withdraw it.

For the purpose of just measuring bi-directional PCIe bandwidth a simple test using just two streams would probably be best.

system · September 6, 2024, 11:34pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Data transfers are slower when overlapped than when running sequentially CUDA Programming and Performance	9	1426	September 29, 2021
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2238	January 18, 2023
Why doesn't overlapping data transfers and kernel execution work here? CUDA Programming and Performance	88	274	April 8, 2025
Time intervals and non-concurrent in multi streaming CUDA Programming and Performance cuda	6	580	April 6, 2023
Can I use streaming to overlap kernels and data transfers in this scenario? CUDA Programming and Performance	13	322	July 5, 2024
Overlapping kernel execution and data transfer CUDA Programming and Performance	9	3425	May 10, 2017
Slow memcpy performance in dual-CPU, 10 GPU system CUDA Programming and Performance cuda , nsight , gpu	24	2248	January 18, 2023
Question about PCI-E transfer throughput CUDA Programming and Performance	13	100	April 5, 2025
Cannot get any stream parallelism. CUDA Programming and Performance	13	1295	December 31, 2019
Optimal Use of Streams? CUDA Programming and Performance	16	2288	August 11, 2010

A100 simplemulticopy

Related topics