Getting BeeGFS to show up in gdscheck.py platform check

Problem:

I cannot get GDS configured with Beegfs support. Specifically “Beegfs: supported/unsupported” field does not show up when running the gdscheck.py script after following the instructions in NVIDIA GPUDirect Storage Installation and Troubleshooting Guide - NVIDIA Docs which outlines the procedure for installing gds and getting it to work with beegfs.

Context:

The system I am trying to build is a part of an academic research project which requires the optimizations offered by GDS in conjunction with Beegfs. I am working on configuring GDS to work with Beegfs Client service for an AMD HPC with the following specifications, drivers, and Beegfs packages:

 RHEL 8.5, Kernel: Linux 4.18.0-348.7.1.el8_5.x86_64

          NIC/HCA: Mellanox ConnectX5

      OFED Driver: MLNX_OFED_LINUX-5.5-1.0.3.2 (OFED-5.5-1.0.3)

    subnetmanager: opensm-5.10.0

              GPU: Nvidia A10

        Nvidia-fs: nvidia-fs-2.16.1 --installed via .runfile

    Nvidia Driver: nvidia-535.54.03 --installed via .runfile

     Cuda/toolkit: cuda-12.2 --installed via .runfile

           Beegfs: beegfs-client-20:7.3.3-el8.noarch --installed via yum

Following the beegfs-quickstart-guide running the management/storage/metadata services with RDMA over Infiniband. I have also verified that this configuration is working before doing anything GDS related. I do this by running the beegfs-net command which yields the following.

mgmt_nodes

=============

HPC25-ALPACA [ID: 1]

Connections: TCP: 1 192.168.11.25:8008;

meta_nodes

=============

HPC25-ALPACA [ID: 25]

Connections: RDMA: 1 192.168.11.25:8005;

storage_nodes

=============

HPC25-ALPACA [ID: 2525]

Connections: RDMA: 1 192.168.11.25:8003;

After completing the steps in the beegfs GDS support document. I don’t have any “Beegfs: supported” or “Beegfs: unsupported” field which is supposed to show up when I run the gdscheck.py script. This python script ends up calling the gdscheck executable which I cannot find source files for anywhere. The buildargs in my beegfs-autobuild.conf file are:

buildArgs=-j8 OFED_INCLUDE_PATH=/usr/src/ofa_kernel/default/include NVFS_INCLUDE_PATH=/usr/src/nvidia-fs-2.16.1 NVIDIA_INCLUDE_PATH=/usr/src/nvidia-535.54.03/nvidia

I have verified that nvfs_dma.h/config-host.h is in the NVFS_INCLUDE_PATH and nv-p2p.h is in the NVIDIA_INCLUDE_PATH.

The output of the gdscheck.py -p script on my machine is bellow.

GDS release version: 1.7.0.149

nvidia_fs version: 2.16 libcufile version: 2.12

Platform: x86_64

============

ENVIRONMENT:

============

=====================

DRIVER CONFIGURATION:

=====================

NVMe : Supported

NVMeOF : Supported

SCSI : Unsupported

ScaleFlux CSD : Unsupported

NVMesh : Unsupported

DDN EXAScaler : Unsupported

IBM Spectrum Scale : Unsupported

NFS : Unsupported

WekaFS : Unsupported

Userspace RDMA : Unsupported

–Mellanox PeerDirect : Enabled

–rdma library : Not Loaded (libcufile_rdma.so)

–rdma devices : Not configured

–rdma_device_status : Up: 0 Down: 0

=====================

CUFILE CONFIGURATION:

=====================

properties.use_compat_mode : true

properties.force_compat_mode : false

properties.gds_rdma_write_support : true

properties.use_poll_mode : false

properties.poll_mode_max_size_kb : 4

properties.max_batch_io_size : 128

properties.max_batch_io_timeout_msecs : 5

properties.max_direct_io_size_kb : 1024

properties.max_device_cache_size_kb : 131072

properties.max_device_pinned_mem_size_kb : 18014398509481980

properties.posix_pool_slab_size_kb : 4 1024 16384

properties.posix_pool_slab_count : 128 64 32

properties.rdma_peer_affinity_policy : RoundRobin

properties.rdma_dynamic_routing : 0

fs.generic.posix_unaligned_writes : false

fs.lustre.posix_gds_min_kb: 0

fs.weka.rdma_write_support: false

fs.gpfs.gds_write_support: false

profile.nvtx : false

profile.cufile_stats : 0

miscellaneous.api_check_aggressive : false

execution.max_io_threads : 0

execution.max_io_queue_depth : 128

execution.parallel_io : false

execution.min_io_threshold_size_kb : 1024

execution.max_request_parallelism : 0

properties.force_odirect_mode : false

properties.prefer_iouring : false

=========

GPU INFO:

=========

GPU index 0 NVIDIA A10 bar:1 bar size (MiB):32768 supports GDS, IOMMU State: Disabled

GPU index 1 NVIDIA A10 bar:1 bar size (MiB):32768 supports GDS, IOMMU State: Disabled

==============

PLATFORM INFO:

==============

IOMMU: disabled

Platform verification succeeded

It may also be worth noting that the beegfs module is loaded. I checked this by running gdscheck.py -V which is the filesystem check.

FILESYSTEM VERSION CHECK:

Pre-requisite:

nvidia_peermem is loaded as required

nvme module is loaded

nvme module is correctly patched

nvme-rdma module is loaded

nvme-rdma module is correctly patched

ScaleFlux module is not loaded

NVMesh module is not loaded

Lustre module is not loaded

BeeGFS module is loaded

BeeGFS module is correctly patched

GPFS module is not loaded

rpcrdma module is not loaded

ofed_info:

current version: MLNX_OFED_LINUX-5.5-1.0.3.2: (Supported)

min version supported: MLNX_OFED_LINUX-4.6-1.0.1.1

I’m anxious to find out why I am not seeing Beegfs in the platform check(gdscheck.py -p) output at all. Let me know if I need to supply any more information about my system or provide more context to the issue.

Thanks!

Dan

We are addressing the issue in upcoming patch release. you can use the previous version of libcufile 1.6.0.25 version

Thank you for the quick response. In efforts to go back to libucfile 1.6.0.25 I decided to uninstall cuda toolkit 12.2 and install cuda toolkit 12.1. This did fix the issue of BeeGFS showing up, it also shows that it is supported! However, my libcufile version says 2.12, GDS release version is 1.6.1.9 and nvidia_fs version is 2.16. I thought I had uninstalled these properly by following the steps provided at the end of the .runfile installation(/usr/local/cuda-12.2/bin/cuda-uninstaller, nvidia-uninstall, and /usr/local/kernelobjects/bin/ko-uninstaller), but the versions that show up when running gdscheck.py -p don’t seem to match those in the installation of cudatoolkit 12.1. Here is the output of gdscheck.py:

GDS release version: 1.6.1.9
 nvidia_fs version:  2.16 libcufile version: 2.12
 Platform: x86_64
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe               : Unsupported
 NVMeOF             : Unsupported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 BeeGFS             : Supported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Enabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_compat_mode : true
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 1024
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 18014398509481980
 properties.posix_pool_slab_size_kb : 4 1024 16384 
 properties.posix_pool_slab_count : 128 64 32 
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.beegfs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 fs.gpfs.gds_write_support: false
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 execution.max_io_threads : 0
 execution.max_io_queue_depth : 128
 execution.parallel_io : false
 execution.min_io_threshold_size_kb : 1024
 execution.max_request_parallelism : 0
 =========
 GPU INFO:
 =========
 GPU index 0 NVIDIA A10 bar:1 bar size (MiB):32768 supports GDS, IOMMU State: Disabled
 GPU index 1 NVIDIA A10 bar:1 bar size (MiB):32768 supports GDS, IOMMU State: Disabled
 ==============
 PLATFORM INFO:
 ==============
 IOMMU: disabled
 Platform verification succeeded

I just want to make sure I don’t have residual issues since this looks like libcufile stayed the same. Did I miss something in my uninstallation of the old toolkit/nvidia drivers? I ran a quick read test to the beegfs mount and it passes read verification, so perhaps this isn’t something to worry about since GDS seems to be working properly with BeeGFS now?

Thanks!

This is okay the libcufile version is the internal version. This is bumped when there is a compatibility issue from nvidia-fs. This version say the min nvidia_fs version that is supported.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.