We used the latest official cuda_sample_test to compile the bandwidthTest tool for testing our hardware. The PCIe hardware is functioning normally. After multiple rounds of long-duration testing, we concluded that there are no fluctuations in the d2h (device-to-host) direction. However, the h2d (host-to-device) tests show fluctuations similar to the graph below.
some test data like this:
2025-02-21 04:08:20 , 26.6
2025-02-21 04:08:20 , 26.6
2025-02-21 04:08:21 , 26.7
2025-02-21 04:08:21 , 26.6
2025-02-21 04:08:22 , 25.0//fluctuations
2025-02-21 04:08:22 , 26.7
2025-02-21 04:08:22 , 26.7
2025-02-21 04:08:23 , 26.7
2025-02-21 04:08:23 , 26.7
2025-02-21 04:08:24 , 26.7
2025-02-21 04:08:24 , 26.6
2025-02-21 04:08:24 , 26.7
2025-02-21 04:08:25 , 26.7
2025-02-21 06:01:08 , 26.6
2025-02-21 06:01:08 , 26.6
2025-02-21 06:01:08 , 26.6
2025-02-21 06:01:09 , 26.7
2025-02-21 06:01:09 , 26.6
2025-02-21 06:01:10 , 25.7//fluctuations
2025-02-21 06:01:10 , 26.7
2025-02-21 06:01:11 , 26.6
2025-02-21 06:01:11 , 26.7
2025-02-21 06:01:11 , 26.6
Testing Command:
Our testing command has been simplified as follows:
gpu_i=0 && while true; do
./bandwidthTest --device=${gpu_i} --htod --csv | grep H2D | awk -F ',' '{print $2}' | awk '{print $3}' | \
awk '{now=strftime("%F %T , ");sub(/^/, now);print}' | \
tee -a ../log/gpu_0_pinned_h2d_01.log
done
Environment Information:
• Kernel version:
Linux pilot 5.15.0-97-generic #107-Ubuntu SMP Wed Feb 7 13:26:48 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
• GPU model:
NVIDIA RTX 4080 Super
• Both CPU and GPU temperatures are normal.
• Cooling systems are functioning properly.
Attempted Solutions (Without Success):
- Set CPU power mode to performance:
sudo cpupower frequency-set -g performance
- Enable GPU persistence mode:
sudo nvidia-smi -pm 1
- Lock GPU clock frequencies:
sudo nvidia-smi -lgc <min_clock>,<max_clock>
sudo nvidia-smi -lmc <memory_clock>
- Set PCIe ASPM (Active State Power Management) to performance mode:
echo performance | sudo tee /sys/module/pcie_aspm/parameters/policy
Despite these efforts, we couldn’t mitigate the data rate fluctuations shown in the graph. Additionally, we used:
nvidia-smi dmon -o TD -s pucvmet
No PCIe-level errors were detected during monitoring.
Additional information:
• PCIe Generation
• Max: 4
• Current: 4
• Device Current: 4
• Device Max: 4
• Host Max: 4
• Link Width
• Max: 16x
• Current: 16x
• Driver Version: 550.54.14
• CUDA Version: 12.4