GDS performance test results are not as expected

GDS performance test results are not as expected
I turned on GDS to test the performance of BeeGFS, and the system prompts that the GDS function is turned on normally. However, the performance of the GDS test is close to the performance without GDS, there is no significant difference between the two, and sometimes the performance of GDS is not as good as without GDS. /mnt/orcafs/ is mounted on the remote BeeGFS file system.
The following is the test data:

with GDS enabled:
[root@myhost ~]# modprobe nvidia-fs
[root@myhost ~]# /usr/local/cuda-12.1/gds/tools/gdsio -f  /mnt/orcafs/data/fio-seq-writes-888 -d 0 -w 4 -s 10G -i 1M -I 1 -x 0
IoType: WRITE XferType: GPUD Threads: 4 DataSetSize: 10430464/10485760(KiB) IOSize: 1024(KiB) Throughput: 3.689626 GiB/sec, Avg_Latency: 1056.230311 usecs ops: 10186 total_time 2.696009 secs
[root@myhost ~]# /usr/local/cuda-12.1/gds/tools/gdsio -f  /mnt/orcafs/data/fio-seq-writes-888 -d 0 -w 4 -s 10G -i 1M -I 0 -x 0
IoType: READ XferType: GPUD Threads: 4 DataSetSize: 10354688/10485760(KiB) IOSize: 1024(KiB) Throughput: 4.415478 GiB/sec, Avg_Latency: 883.326053 usecs ops: 10112 total_time 2.236451 secs

without GDS enabled:
[root@myhost ~]# modprobe -r nvidia_fs
[root@myhost ~]# /usr/local/cuda-12.1/gds/tools/gdsio -f  /mnt/orcafs/data/fio-seq-writes-701 -d 0 -w 4 -s 10G -i 1M -I 1 -x 0
IoType: WRITE XferType: GPUD Threads: 4 DataSetSize: 10345472/10485760(KiB) IOSize: 1024(KiB) Throughput: 3.825947 GiB/sec, Avg_Latency: 1020.608027 usecs ops: 10103 total_time 2.578763 secs
[root@myhost ~]# /usr/local/cuda-12.1/gds/tools/gdsio -f  /mnt/orcafs/data/fio-seq-writes-701 -d 0 -w 4 -s 10G -i 1M -I 0 -x 0
IoType: READ XferType: GPUD Threads: 4 DataSetSize: 10389504/10485760(KiB) IOSize: 1024(KiB) Throughput: 4.463720 GiB/sec, Avg_Latency: 873.643496 usecs ops: 10146 total_time 2.219719 secs

According to my understanding, I think the most likely reason is that the PCI affinity of my GPU and network card is not enough (below, the topology is shown as PHB). But I’m not sure if this is the root cause, can anyone help me, thanks!

The following is my environment and configuration information.

[root@myhost ~]# nvidia-smi
Tue May 23 18:22:23 2023
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla P4                        Off| 00000000:83:00.0 Off |                    0 |
| N/A   50C    P0               23W /  75W|      0MiB /  7680MiB |      1%      Default |
|                                         |                      |                  N/A |

| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|  No running processes found                                                           |
[root@myhost ~]# nvidia-smi topo -mp
        GPU0    NIC0    CPU Affinity    NUMA Affinity
GPU0     X      PHB     14-27,42-55     1
NIC0    PHB      X


  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge

NIC Legend:

  NIC0: mlx5_0

[root@myhost ~]# /usr/local/cuda-12.1/gds/tools/gdscheck -p
 GDS release version:
 nvidia_fs version:  2.15 libcufile version: 2.12
 Platform: x86_64
 NVMe               : Supported
 NVMeOF             : Unsupported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 BeeGFS             : Supported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Disabled
 --rdma library        : Not Loaded (
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 properties.use_compat_mode : true
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384
 properties.posix_pool_slab_count : 128 64 32
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 1
 properties.rdma_dynamic_routing_order : GPU_MEM_NVLINKS GPU_MEM SYS_MEM P2P
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.beegfs.posix_gds_min_kb: 0
 fs.beegfs.rdma_dev_addr_list :
 fs.weka.rdma_write_support: false
 fs.gpfs.gds_write_support: false
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 execution.max_io_threads : 0
 execution.max_io_queue_depth : 128
 execution.parallel_io : false
 execution.min_io_threshold_size_kb : 8192
 execution.max_request_parallelism : 0
 GPU index 0 Tesla P4 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
 IOMMU: disabled
 Platform verification succeeded

The config seems okay. please check in cufile.log if during the GDS test if IO is falling back to compat mode.

compat mode uses the best practices we have with GPU to. get optimal performance with GPUs.
so it is very well possible. to match throughput when CPU is not a bottleneck.

please try time mode -T 30 to average out better.

Also is 4.4GiB.sec SOL for the network. it could very well be that the network is the bottleneck.
increasing the threads (- W 6) might help with throughput if network latency is high.

Also In this case I would try scaling to smaller IO sizes to see the difference in IOPs with GPUD and compat mode.

According to your suggestion, I did the test again. The results of the test lasting for 30 seconds are basically the same as those without the - T 30 option.

Based on the following description, I believe that the unsatisfactory acceleration effect when GDS is enabled may be related to the topology of the PCIe device, or in fact, although GDS is enabled, the CPU is not actually bypassed.

One noteworthy situation is that regardless of whether I turn on GDS or not, the CPU consumption is around 250%. Based on this, I suspect that it seems that the CPU and system memory have not been truly bypassed. Who can provide some opinions or insights.

When GDS is enabled, the cufile.log says:

24-05-2023 15:20:44:646 [pid=2088378 tid=2088378] ERROR  cufio-dr:226 No matching pair for network device to closest GPU found in the platform
24-05-2023 15:22:59:446 [pid=2096916 tid=2096916] ERROR  cufio-dr:226 No matching pair for network device to closest GPU found in the platform

When GDS is disabled, the cufile.log says:

24-05-2023 15:46:00:338 [pid=2195236 tid=2195236] NOTICE  cufio-drv:720 running in compatible mode
24-05-2023 15:46:44:215 [pid=2197705 tid=2197705] NOTICE  cufio-drv:720 running in compatible mode

The method I used to close GDS is to remove the nvida fs module with the command ‘modprobe -r nvidia_fs’, which should be the same as setting the option - x to - x 2 (representing 2- Storage ->CPU ->GPU), I guess.