Problem:
I cannot get GDS configured with Beegfs support. Specifically “Beegfs: supported/unsupported” field does not show up when running the gdscheck.py script after following the instructions in NVIDIA GPUDirect Storage Installation and Troubleshooting Guide - NVIDIA Docs which outlines the procedure for installing gds and getting it to work with beegfs.
Context:
The system I am trying to build is a part of an academic research project which requires the optimizations offered by GDS in conjunction with Beegfs. I am working on configuring GDS to work with Beegfs Client service for an AMD HPC with the following specifications, drivers, and Beegfs packages:
RHEL 8.5, Kernel: Linux 4.18.0-348.7.1.el8_5.x86_64
NIC/HCA: Mellanox ConnectX5
OFED Driver: MLNX_OFED_LINUX-5.5-1.0.3.2 (OFED-5.5-1.0.3)
subnetmanager: opensm-5.10.0
GPU: Nvidia A10
Nvidia-fs: nvidia-fs-2.16.1 --installed via .runfile
Nvidia Driver: nvidia-535.54.03 --installed via .runfile
Cuda/toolkit: cuda-12.2 --installed via .runfile
Beegfs: beegfs-client-20:7.3.3-el8.noarch --installed via yum
Following the beegfs-quickstart-guide running the management/storage/metadata services with RDMA over Infiniband. I have also verified that this configuration is working before doing anything GDS related. I do this by running the beegfs-net command which yields the following.
mgmt_nodes
=============
HPC25-ALPACA [ID: 1]
Connections: TCP: 1 192.168.11.25:8008;
meta_nodes
=============
HPC25-ALPACA [ID: 25]
Connections: RDMA: 1 192.168.11.25:8005;
storage_nodes
=============
HPC25-ALPACA [ID: 2525]
Connections: RDMA: 1 192.168.11.25:8003;
After completing the steps in the beegfs GDS support document. I don’t have any “Beegfs: supported” or “Beegfs: unsupported” field which is supposed to show up when I run the gdscheck.py script. This python script ends up calling the gdscheck executable which I cannot find source files for anywhere. The buildargs in my beegfs-autobuild.conf file are:
buildArgs=-j8 OFED_INCLUDE_PATH=/usr/src/ofa_kernel/default/include NVFS_INCLUDE_PATH=/usr/src/nvidia-fs-2.16.1 NVIDIA_INCLUDE_PATH=/usr/src/nvidia-535.54.03/nvidia
I have verified that nvfs_dma.h/config-host.h is in the NVFS_INCLUDE_PATH and nv-p2p.h is in the NVIDIA_INCLUDE_PATH.
The output of the gdscheck.py -p script on my machine is bellow.
GDS release version: 1.7.0.149
nvidia_fs version: 2.16 libcufile version: 2.12
Platform: x86_64
============
ENVIRONMENT:
============
=====================
DRIVER CONFIGURATION:
=====================
NVMe : Supported
NVMeOF : Supported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
WekaFS : Unsupported
Userspace RDMA : Unsupported
–Mellanox PeerDirect : Enabled
–rdma library : Not Loaded (libcufile_rdma.so)
–rdma devices : Not configured
–rdma_device_status : Up: 0 Down: 0
=====================
CUFILE CONFIGURATION:
=====================
properties.use_compat_mode : true
properties.force_compat_mode : false
properties.gds_rdma_write_support : true
properties.use_poll_mode : false
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_size : 128
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 1024
properties.max_device_cache_size_kb : 131072
properties.max_device_pinned_mem_size_kb : 18014398509481980
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 32
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 0
fs.generic.posix_unaligned_writes : false
fs.lustre.posix_gds_min_kb: 0
fs.weka.rdma_write_support: false
fs.gpfs.gds_write_support: false
profile.nvtx : false
profile.cufile_stats : 0
miscellaneous.api_check_aggressive : false
execution.max_io_threads : 0
execution.max_io_queue_depth : 128
execution.parallel_io : false
execution.min_io_threshold_size_kb : 1024
execution.max_request_parallelism : 0
properties.force_odirect_mode : false
properties.prefer_iouring : false
=========
GPU INFO:
=========
GPU index 0 NVIDIA A10 bar:1 bar size (MiB):32768 supports GDS, IOMMU State: Disabled
GPU index 1 NVIDIA A10 bar:1 bar size (MiB):32768 supports GDS, IOMMU State: Disabled
==============
PLATFORM INFO:
==============
IOMMU: disabled
Platform verification succeeded
It may also be worth noting that the beegfs module is loaded. I checked this by running gdscheck.py -V which is the filesystem check.
FILESYSTEM VERSION CHECK:
Pre-requisite:
nvidia_peermem is loaded as required
nvme module is loaded
nvme module is correctly patched
nvme-rdma module is loaded
nvme-rdma module is correctly patched
ScaleFlux module is not loaded
NVMesh module is not loaded
Lustre module is not loaded
BeeGFS module is loaded
BeeGFS module is correctly patched
GPFS module is not loaded
rpcrdma module is not loaded
ofed_info:
current version: MLNX_OFED_LINUX-5.5-1.0.3.2: (Supported)
min version supported: MLNX_OFED_LINUX-4.6-1.0.1.1
I’m anxious to find out why I am not seeing Beegfs in the platform check(gdscheck.py -p) output at all. Let me know if I need to supply any more information about my system or provide more context to the issue.
Thanks!
Dan