Using the bandwidthTest tool to test the GPU in the htod direction, have fluctuations in the transfer rate

We used the latest official cuda_sample_test to compile the bandwidthTest tool for testing our hardware. The PCIe hardware is functioning normally. After multiple rounds of long-duration testing, we concluded that there are no fluctuations in the d2h (device-to-host) direction. However, the h2d (host-to-device) tests show fluctuations similar to the graph below.

some test data like this:

2025-02-21 04:08:20 , 26.6
2025-02-21 04:08:20 , 26.6
2025-02-21 04:08:21 , 26.7
2025-02-21 04:08:21 , 26.6
2025-02-21 04:08:22 , 25.0//fluctuations
2025-02-21 04:08:22 , 26.7
2025-02-21 04:08:22 , 26.7
2025-02-21 04:08:23 , 26.7
2025-02-21 04:08:23 , 26.7
2025-02-21 04:08:24 , 26.7
2025-02-21 04:08:24 , 26.6
2025-02-21 04:08:24 , 26.7
2025-02-21 04:08:25 , 26.7

2025-02-21 06:01:08 , 26.6
2025-02-21 06:01:08 , 26.6
2025-02-21 06:01:08 , 26.6
2025-02-21 06:01:09 , 26.7
2025-02-21 06:01:09 , 26.6
2025-02-21 06:01:10 , 25.7//fluctuations
2025-02-21 06:01:10 , 26.7
2025-02-21 06:01:11 , 26.6
2025-02-21 06:01:11 , 26.7
2025-02-21 06:01:11 , 26.6

Testing Command:

Our testing command has been simplified as follows:

gpu_i=0 && while true; do 
  ./bandwidthTest --device=${gpu_i} --htod --csv | grep H2D | awk -F ',' '{print $2}' | awk '{print $3}' | \
  awk '{now=strftime("%F %T , ");sub(/^/, now);print}' | \
  tee -a ../log/gpu_0_pinned_h2d_01.log
done

Environment Information:

Kernel version:

Linux pilot 5.15.0-97-generic #107-Ubuntu SMP Wed Feb 7 13:26:48 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

GPU model:

NVIDIA RTX 4080 Super

• Both CPU and GPU temperatures are normal.

• Cooling systems are functioning properly.

Attempted Solutions (Without Success):

  1. Set CPU power mode to performance:
sudo cpupower frequency-set -g performance
  1. Enable GPU persistence mode:
sudo nvidia-smi -pm 1
  1. Lock GPU clock frequencies:
sudo nvidia-smi -lgc <min_clock>,<max_clock>
sudo nvidia-smi -lmc <memory_clock>
  1. Set PCIe ASPM (Active State Power Management) to performance mode:
echo performance | sudo tee /sys/module/pcie_aspm/parameters/policy

Despite these efforts, we couldn’t mitigate the data rate fluctuations shown in the graph. Additionally, we used:

nvidia-smi dmon -o TD -s pucvmet

No PCIe-level errors were detected during monitoring.

Additional information:

PCIe Generation

• Max: 4

• Current: 4

• Device Current: 4

• Device Max: 4

• Host Max: 4

Link Width

• Max: 16x

• Current: 16x

Driver Version: 550.54.14

CUDA Version: 12.4

Here is the results of a quick test I ran, on a L4 GPU, on ubuntu 22.04 5.15.0-78-generic kernel, as root, with the GPU not configured to drive a display, on a PCIE Gen3 server:

# while sleep 1; do cuda-samples/bin/x86_64/linux/release/bandwidthTest --memory=pinned --htod |grep 32; done
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.4
   32000000                     12.3
   32000000                     12.3
   32000000                     12.4
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.4
   32000000                     12.3
   32000000                     12.3
   32000000                     12.4
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.4
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
   32000000                     12.3
^C
#

(CUDA 12.2, driver 535.86.10)

That represents a test duration of about 3-4 minutes. I didn’t make any other settings, other than what you see there. I can’t explain what may be happening in your case, however if the 4080 is driving a display, you might want to reconfigure that. You might want to try running as root. You could also look at something like top or ps -ef and see what else is running, and see if you can stop anything unnecessary from running. It might also be that variation is expected and only shows up in PCIE Gen4 setting. You might also want to investigate system topology. If the system has PCIE switches between the CPU socket and the GPU, that may be a factor, especially if there are other devices hanging off those switches. In my test/server there is only one GPU in the system, and no PCIE switches between the CPU socket and the GPU. I’m also running without any containers or anything like that.

The magnitude of those fluctuations (maximum of 7%) seems within the range of normal variance for memory operations to me, and the absolute performance seen is in the “meet or exceed” range for a PCIe4 x16 interconnect. Anything above 25 GB/sec should be considered icing on the cake. Generally speaking, at the upper boundaries of achievable performance, spare capacity of various buffer structures in processors can become marginal, but there is no point in overdesigning them, as it adds cost.

I don’t know what methodology is used by cuda_sample_test. Because of expected variance, a good heuristic established by the STREAM benchmark decades ago is to run with each test configuration ten times, then report the performance of the best run.

Any transfers between host and device can be impacted not just by the device performance and the interconnect performance, but also the performance characteristics of the host’s system memory. I did not spot any host system specifications, so it is impossible to point out specific potential caveats (such as a system memory that is sub-optimally configured in some way). At minimum:

(1) Ensure that the host system is completely idle other than for the benchmark running. System memory is a shared resource, so any memory activity by other agents can impact the bandwidth available for PCIe transfers.

(2) Use numactl to fix processor and memory affinity so the benchmark uses the same CPU memory controller(s) in every run. Even in host systems with a single CPU socket, many modern CPUs are constructed from core clusters that are tied together with a high-performance internal interconnect, causing the CPU to have mild NUMA characteristics.

this is the bandwidthTest I used.

Thank you for your reply. In my tests, the duration typically ranges from 4 to 24 hours. After multiple rounds of testing, I found that only H2D exhibits this occasional bandwidth drop.In contrast, continuous D2H tests do not show any bandwidth drops. I have also updated to the latest official driver, but the issue still persists.My environment is dedicated to my use only, with plenty of available CPU and memory resources and no other applications running.

I also have a question: if I want to continuously test this bandwidth, is there a better way than running bandwidthTest repeatedly and recording each result? I believe this approach introduces some inaccuracies and is somewhat inefficient.Instead, I would like to continuously transfer H2D or D2H data over the PCIe lanes to observe performance, similar to how a network tool like iperf3 works.

Thank you for your reply. My CPU does not have additional NUMA nodes, so there should be no memory consistency issues.

I would also like to ask if you have any suggestions for a better way to perform continuous H2D or D2H bandwidth testing, rather than running the bandwidthTest repeatedly.

I don’t know that there is anything wrong with bandwidthTest. You could always program your own test for measuring PCIe bandwidth if you mistrust this sample app. Any results you achieve with your own test app would not necessarily differ materially from the results of bandwidthTest.

As I stated previously, the data posted does not imply “memory consistency issues”, and the observation as stated should not be reason for concern; nothing appears to be “wrong” with the system.

Bandwidth tests of any kind tend to have higher variance compared to compute-bound benchmarks. Part of the problem is the large amount of state that is involved in running such tests, and the inability for that state to be replicated exactly from run to run. Also, small differences in initial state can lead to more sizeable differences in final results due to the butterfly effect.

That being said, I believe it would make sense to take a quick look at the host system specification to check whether part of the benchmark variance observed could be due to marginal performance headroom on the host system side.

(1) What is the CPU being used?
(2) How is the system memory configured? How many memory channels are being used (populated)? What speed grade of DDR4 / DDR5 is used?

Thanks

CPU is AMD EPYC 7643 48-Core Processor

The system has a total of 128 GiB of DDR4 registered (buffered) memory with ECC (Error-Correcting Code) support. The motherboard has a capacity of up to 4 TiB of memory.

There are 8 DIMM slots populated, each with the following specifications:

Size: 16 GiB per module

Type: DDR4 Synchronous Registered (Buffered)

Speed: 3200 MHz

Width: 64 bits

Error Detection: Multi-bit ECC enabled

Regarding the bandwidth test, I trust this tool, but I would like to find a more suitable, officially provided tool that can continuously test H2D and D2H transfers.
Since this is hardware we developed ourselves, we need to verify its reliability through certain software-level metrics. Currently, the occasional drop in speed we are experiencing cannot be confirmed as an issue caused by the testing program or a problem stemming from the hardware design itself.

That is most certainly a CPU constructed from multiple chiplets (comprising a total of eight core complexes of six cores each, according to the TechPowerUp database) that communicate with each other using an internal high-speed interconnect. It therefore exhibits mild NUMA characteristics. I would re-iterate my earlier suggestion of fixing processor and memory affinity with numactl to see whether this leads to a reduction in the variability observed in the PCIe throughput test.

Assuming all eight DRAM channels are populated with DDR4-3200 memory, the practically achievable system memory throughput should be on the order 160 to 165 GB/sec, more than enough to source data for host->device transfers without hiccups. Many system BIOSes offer the option of selecting “1T command rate” to minimize DRAM latency, which requires that no more than 1 DIMM is installed per channel.

Why could the acceptance criterion not be “PCIe throughput in each direction >= 25 GB/sec”? I guess I still do not understand why there is concern about some (not unexpected) variability ranging from 25 GB/sec and 26.7 GB/sec, given that expected level PCIe4 throughput is 25+ GB/sec.

Thank you again for your response and understanding. Our acceptance criterion is that (max - min) / min should be less than 5%, meaning that if the maximum is 26.7 GB/s, the minimum should not fall below 25.4 GB/s.

After multiple rounds of testing, we did observe fluctuations as low as 25.0 GB/s or even lower, though the probability of such occurrences might be as low as 1 in 3000 or less. We cannot confirm whether this fluctuation is caused by the hardware (although, at present, it seems unlikely).

Therefore, our proposed solution is to either eliminate this fluctuation, come up with a more reasonable testing method, or define a more acceptable fluctuation range (5%? 7%? 10%? I am not sure what the appropriate threshold should be).

Any specific motivations for the 5% limit? Why even have a variability limit, when a lower performance bound is all that should be required in the vast majority of use cases?

Anyway, I would suggest doing experiments with numactl to see whether you can get the variability limited to a narrower range (my expectation is that use of numactlwill achieve that), then revisit the question of what the bounds should be. From what I have seen over thirty years of bandwidth measurements, I would set a higher variability bound (if one is definitely required) wider, say 10%. In my experience of reviewing lists of daily test failures across a 400 machine cluster, a lot of frustration can accumulate quickly when spending time to investigate what ultimately turn out to be false positives, i.e. non issues.

You could modify and recompile the bandwidth test utility to run continuously.

Do you see it as another option for your requirements to limit the maximum bandwidth? Perhaps there are hardware settings or software solutions to achieve that?
E.g. with added delays you could vastly reduce the variance.

Thank you for your reply. We compiled our tool from the source code downloaded from GitHub. It seems that bandwidthTest does not support setting a maximum bandwidth limit for testing.

Instead of adding delays, we are more interested in continuously and stably testing the bandwidth.

Those were meant as two different answers:

You can modify the bandwidth tool to run the test several (or indefinite) times and e.g. reuse memory (but be careful that it does not just read from the cache).
During that time you can use nvidia-smi and other tools to watch energy states and clock frequencies.

If your requirement is to keep the variance low, beside changing the allowed variance as njuffa suggested, you can try to limit the bandwidth somehow. The question is whether to disallow high bandwidth outliers.

Perhaps the requirement should be something like: With 1000000 transactions each had at least 20 GB/s. 99.9% were above 24.5 GB/s.

Thank you for your suggestion. I’ll continue to try it on my end.