I have configured GDS and BeeGFS according to the official website of NVIDIA. The verification script prompts that BeeGFS is supported, but when I write files to the directory mounted on BeeGFS, cuFileHandleRegister returns error code 5003, which means “internal error”
I successfully wrote on the NVME device using the same method. Here is my environment information and operation process.
Can anyone help me? Thank you very much!
@sougupta
[root@orcafs19141 samples]# df -hT
Filesystem Type Size Used Avail Use% Mounted on
devtmpfs devtmpfs 63G 0 63G 0% /dev
tmpfs tmpfs 63G 0 63G 0% /dev/shm
tmpfs tmpfs 63G 34M 63G 1% /run
tmpfs tmpfs 63G 0 63G 0% /sys/fs/cgroup
/dev/mapper/cl_orcafs-root xfs 70G 39G 32G 56% /
/dev/sda1 xfs 1014M 268M 747M 27% /boot
tmpfs tmpfs 13G 0 13G 0% /run/user/0
/dev/nvme0n1 ext4 916G 140M 870G 1% /mnt/nvme
orcafs_nodev beegfs 2.8T 20G 2.8T 1% /mnt/orcafs
[root@orcafs19141 samples]# /usr/local/cuda-12.1/gds/tools/gdscheck.py -p
GDS release version: 1.6.1.9
nvidia_fs version: 2.15 libcufile version: 2.12
Platform: x86_64
============
ENVIRONMENT:
============
CUFILE_ENV_PATH_JSON : /root/workspace/GDS/cufile.json
=====================
DRIVER CONFIGURATION:
=====================
NVMe : Supported
NVMeOF : Supported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
BeeGFS : Supported
WekaFS : Unsupported
Userspace RDMA : Unsupported
--Mellanox PeerDirect : Disabled
--rdma library : Not Loaded (libcufile_rdma.so)
--rdma devices : Not configured
--rdma_device_status : Up: 0 Down: 0
=====================
CUFILE CONFIGURATION:
=====================
properties.use_compat_mode : true
properties.force_compat_mode : false
properties.gds_rdma_write_support : true
properties.use_poll_mode : false
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_size : 128
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 16384
properties.max_device_cache_size_kb : 131072
properties.max_device_pinned_mem_size_kb : 33554432
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 32
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 0
fs.generic.posix_unaligned_writes : false
fs.lustre.posix_gds_min_kb: 0
fs.beegfs.posix_gds_min_kb: 4
fs.beegfs.rdma_dev_addr_list : 192.168.20.141 192.168.20.142
fs.weka.rdma_write_support: false
fs.gpfs.gds_write_support: false
profile.nvtx : false
profile.cufile_stats : 0
miscellaneous.api_check_aggressive : false
execution.max_io_threads : 0
execution.max_io_queue_depth : 128
execution.parallel_io : false
execution.min_io_threshold_size_kb : 8192
execution.max_request_parallelism : 0
=========
GPU INFO:
=========
GPU index 0 Tesla P4 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
==============
PLATFORM INFO:
==============
IOMMU: disabled
Platform verification succeeded
[root@orcafs19141 samples]# /usr/local/cuda-12.1/gds/samples/cufile_sample_001 /mnt/nvme/testGPUx 0
opening file /mnt/nvme/testGPUx
registering device memory of size :131072
writing from device memory
deregistering device memory
[root@orcafs19141 samples]# /usr/local/cuda-12.1/gds/samples/cufile_sample_001 /mnt/orcafs/data/testGPUx 0
opening file /mnt/orcafs/data/testGPUx
file register error:internal error
file register error code: 5030
cat cufile.log
12-05-2023 10:50:16:462 [pid=339589 tid=339589] ERROR cufio-fs:322 error creating udev_device for block device dev_no: 0:46
12-05-2023 10:50:16:462 [pid=339589 tid=339589] ERROR cufio-fs:742 error getting volume attributes error for device: dev_no: 0:46
12-05-2023 10:50:16:462 [pid=339589 tid=339589] DEBUG cufio:1137 cuFile DIO status for file descriptor 45 DirectIO not supported
12-05-2023 10:50:16:462 [pid=339589 tid=339589] NOTICE cufio:1546 cuFileHandleRegister GDS not supported or disabled by config, using cuFile posix read/write with compat mode enabled
12-05-2023 10:50:16:463 [pid=339589 tid=339589] ERROR cufio-fs:322 error creating udev_device for block device dev_no: 0:46
12-05-2023 10:50:16:463 [pid=339589 tid=339589] ERROR cufio-fs:742 error getting volume attributes error for device: dev_no: 0:46
12-05-2023 10:50:16:463 [pid=339589 tid=339589] ERROR cufio-obj:177 unable to get volume attributes for fd 45
12-05-2023 10:50:16:463 [pid=339589 tid=339589] ERROR cufio:1564 cuFileHandleRegister error, failed to allocate file object
12-05-2023 10:50:16:463 [pid=339589 tid=339589] ERROR cufio:1592 cuFileHandleRegister error: internal error