GDS performance test results are not as expected
I turned on GDS to test the performance of BeeGFS, and the system prompts that the GDS function is turned on normally. However, the performance of the GDS test is close to the performance without GDS, there is no significant difference between the two, and sometimes the performance of GDS is not as good as without GDS. /mnt/orcafs/ is mounted on the remote BeeGFS file system.
The following is the test data:
with GDS enabled:
[root@myhost ~]# modprobe nvidia-fs
[root@myhost ~]# /usr/local/cuda-12.1/gds/tools/gdsio -f /mnt/orcafs/data/fio-seq-writes-888 -d 0 -w 4 -s 10G -i 1M -I 1 -x 0
IoType: WRITE XferType: GPUD Threads: 4 DataSetSize: 10430464/10485760(KiB) IOSize: 1024(KiB) Throughput: 3.689626 GiB/sec, Avg_Latency: 1056.230311 usecs ops: 10186 total_time 2.696009 secs
[root@myhost ~]# /usr/local/cuda-12.1/gds/tools/gdsio -f /mnt/orcafs/data/fio-seq-writes-888 -d 0 -w 4 -s 10G -i 1M -I 0 -x 0
IoType: READ XferType: GPUD Threads: 4 DataSetSize: 10354688/10485760(KiB) IOSize: 1024(KiB) Throughput: 4.415478 GiB/sec, Avg_Latency: 883.326053 usecs ops: 10112 total_time 2.236451 secs
without GDS enabled:
[root@myhost ~]# modprobe -r nvidia_fs
[root@myhost ~]# /usr/local/cuda-12.1/gds/tools/gdsio -f /mnt/orcafs/data/fio-seq-writes-701 -d 0 -w 4 -s 10G -i 1M -I 1 -x 0
IoType: WRITE XferType: GPUD Threads: 4 DataSetSize: 10345472/10485760(KiB) IOSize: 1024(KiB) Throughput: 3.825947 GiB/sec, Avg_Latency: 1020.608027 usecs ops: 10103 total_time 2.578763 secs
[root@myhost ~]# /usr/local/cuda-12.1/gds/tools/gdsio -f /mnt/orcafs/data/fio-seq-writes-701 -d 0 -w 4 -s 10G -i 1M -I 0 -x 0
IoType: READ XferType: GPUD Threads: 4 DataSetSize: 10389504/10485760(KiB) IOSize: 1024(KiB) Throughput: 4.463720 GiB/sec, Avg_Latency: 873.643496 usecs ops: 10146 total_time 2.219719 secs
According to my understanding, I think the most likely reason is that the PCI affinity of my GPU and network card is not enough (below, the topology is shown as PHB). But I’m not sure if this is the root cause, can anyone help me, thanks!
The following is my environment and configuration information.
[root@myhost ~]# nvidia-smi
Tue May 23 18:22:23 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla P4 Off| 00000000:83:00.0 Off | 0 |
| N/A 50C P0 23W / 75W| 0MiB / 7680MiB | 1% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
[root@myhost ~]# nvidia-smi topo -mp
GPU0 NIC0 CPU Affinity NUMA Affinity
GPU0 X PHB 14-27,42-55 1
NIC0 PHB X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NIC Legend:
NIC0: mlx5_0
[root@myhost ~]# /usr/local/cuda-12.1/gds/tools/gdscheck -p
GDS release version: 1.6.1.9
nvidia_fs version: 2.15 libcufile version: 2.12
Platform: x86_64
============
ENVIRONMENT:
============
=====================
DRIVER CONFIGURATION:
=====================
NVMe : Supported
NVMeOF : Unsupported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
BeeGFS : Supported
WekaFS : Unsupported
Userspace RDMA : Unsupported
--Mellanox PeerDirect : Disabled
--rdma library : Not Loaded (libcufile_rdma.so)
--rdma devices : Not configured
--rdma_device_status : Up: 0 Down: 0
=====================
CUFILE CONFIGURATION:
=====================
properties.use_compat_mode : true
properties.force_compat_mode : false
properties.gds_rdma_write_support : true
properties.use_poll_mode : false
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_size : 128
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 16384
properties.max_device_cache_size_kb : 131072
properties.max_device_pinned_mem_size_kb : 33554432
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 32
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 1
properties.rdma_dynamic_routing_order : GPU_MEM_NVLINKS GPU_MEM SYS_MEM P2P
fs.generic.posix_unaligned_writes : false
fs.lustre.posix_gds_min_kb: 0
fs.beegfs.posix_gds_min_kb: 0
fs.beegfs.rdma_dev_addr_list : 192.168.20.142
fs.weka.rdma_write_support: false
fs.gpfs.gds_write_support: false
profile.nvtx : false
profile.cufile_stats : 0
miscellaneous.api_check_aggressive : false
execution.max_io_threads : 0
execution.max_io_queue_depth : 128
execution.parallel_io : false
execution.min_io_threshold_size_kb : 8192
execution.max_request_parallelism : 0
=========
GPU INFO:
=========
GPU index 0 Tesla P4 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
==============
PLATFORM INFO:
==============
IOMMU: disabled
Platform verification succeeded