I have configured GDS and BeeGFS according to the official website of NVIDIA. The verification script prompts that BeeGFS is supported, but when I write files to the directory mounted on BeeGFS, cuFileHandleRegister returns error code 5003, which means “internal error”
I successfully wrote on the NVME device using the same method. Here is my environment information and operation process.
Can anyone help me? Thank you very much!
@sougupta
[root@orcafs19141 samples]# df -hT
Filesystem Type Size Used Avail Use% Mounted on
devtmpfs devtmpfs 63G 0 63G 0% /dev
tmpfs tmpfs 63G 0 63G 0% /dev/shm
tmpfs tmpfs 63G 34M 63G 1% /run
tmpfs tmpfs 63G 0 63G 0% /sys/fs/cgroup
/dev/mapper/cl_orcafs-root xfs 70G 39G 32G 56% /
/dev/sda1 xfs 1014M 268M 747M 27% /boot
tmpfs tmpfs 13G 0 13G 0% /run/user/0
/dev/nvme0n1 ext4 916G 140M 870G 1% /mnt/nvme
orcafs_nodev beegfs 2.8T 20G 2.8T 1% /mnt/orcafs
[root@orcafs19141 samples]# /usr/local/cuda-12.1/gds/tools/gdscheck.py -p
GDS release version: 1.6.1.9
nvidia_fs version: 2.15 libcufile version: 2.12
Platform: x86_64
============
ENVIRONMENT:
============
CUFILE_ENV_PATH_JSON : /root/workspace/GDS/cufile.json
=====================
DRIVER CONFIGURATION:
=====================
NVMe : Supported
NVMeOF : Supported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
BeeGFS : Supported
WekaFS : Unsupported
Userspace RDMA : Unsupported
--Mellanox PeerDirect : Disabled
--rdma library : Not Loaded (libcufile_rdma.so)
--rdma devices : Not configured
--rdma_device_status : Up: 0 Down: 0
=====================
CUFILE CONFIGURATION:
=====================
properties.use_compat_mode : true
properties.force_compat_mode : false
properties.gds_rdma_write_support : true
properties.use_poll_mode : false
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_size : 128
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 16384
properties.max_device_cache_size_kb : 131072
properties.max_device_pinned_mem_size_kb : 33554432
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 32
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 0
fs.generic.posix_unaligned_writes : false
fs.lustre.posix_gds_min_kb: 0
fs.beegfs.posix_gds_min_kb: 4
fs.beegfs.rdma_dev_addr_list : 192.168.20.141 192.168.20.142
fs.weka.rdma_write_support: false
fs.gpfs.gds_write_support: false
profile.nvtx : false
profile.cufile_stats : 0
miscellaneous.api_check_aggressive : false
execution.max_io_threads : 0
execution.max_io_queue_depth : 128
execution.parallel_io : false
execution.min_io_threshold_size_kb : 8192
execution.max_request_parallelism : 0
=========
GPU INFO:
=========
GPU index 0 Tesla P4 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
==============
PLATFORM INFO:
==============
IOMMU: disabled
Platform verification succeeded
[root@orcafs19141 samples]# /usr/local/cuda-12.1/gds/samples/cufile_sample_001 /mnt/nvme/testGPUx 0
opening file /mnt/nvme/testGPUx
registering device memory of size :131072
writing from device memory
deregistering device memory
[root@orcafs19141 samples]# /usr/local/cuda-12.1/gds/samples/cufile_sample_001 /mnt/orcafs/data/testGPUx 0
opening file /mnt/orcafs/data/testGPUx
file register error:internal error
file register error code: 5030
cat cufile.log
12-05-2023 10:50:16:462 [pid=339589 tid=339589] ERROR cufio-fs:322 error creating udev_device for block device dev_no: 0:46
12-05-2023 10:50:16:462 [pid=339589 tid=339589] ERROR cufio-fs:742 error getting volume attributes error for device: dev_no: 0:46
12-05-2023 10:50:16:462 [pid=339589 tid=339589] DEBUG cufio:1137 cuFile DIO status for file descriptor 45 DirectIO not supported
12-05-2023 10:50:16:462 [pid=339589 tid=339589] NOTICE cufio:1546 cuFileHandleRegister GDS not supported or disabled by config, using cuFile posix read/write with compat mode enabled
12-05-2023 10:50:16:463 [pid=339589 tid=339589] ERROR cufio-fs:322 error creating udev_device for block device dev_no: 0:46
12-05-2023 10:50:16:463 [pid=339589 tid=339589] ERROR cufio-fs:742 error getting volume attributes error for device: dev_no: 0:46
12-05-2023 10:50:16:463 [pid=339589 tid=339589] ERROR cufio-obj:177 unable to get volume attributes for fd 45
12-05-2023 10:50:16:463 [pid=339589 tid=339589] ERROR cufio:1564 cuFileHandleRegister error, failed to allocate file object
12-05-2023 10:50:16:463 [pid=339589 tid=339589] ERROR cufio:1592 cuFileHandleRegister error: internal error
1 Like
In fact, I was trying to add my own distributed file system to support GDS. I masqueraded my file system as BeeGFS. I have found the answer to this question, I need to change the magic of our own file system to be the same as BeeGFS in nvidia-fs. Now this problem has been fixed. The current situation is that our system can read and write files when GDS is enabled, but when the memory is released (calling cudaFree()), the CPU gets stuck, and then the system is abnormal, and it cannot be recovered unless it is restarted. The print is as follows, Can someone provide some help, thanks!
"watchdog: BUG: soft lockup - CPU#32 stuck for 22s! [cufile_sample_0:61865]"
The system call stack is as follows:
Wed May 17 20:34:57 2023] rcu: INFO: rcu_sched self-detected stall on CPU
[Wed May 17 20:34:57 2023] rcu: 35-....: (59929 ticks this GP) idle=3fe/1/0x4000000000000002 softirq=8880/8889 fqs=14990
[Wed May 17 20:34:57 2023] (t=60000 jiffies g=126349 q=14148)
[Wed May 17 20:34:57 2023] NMI backtrace for cpu 35
[Wed May 17 20:34:57 2023] CPU: 35 PID: 21428 Comm: cufile_sample_0 Kdump: loaded Tainted: P OEL --------- -t-4.18.0-240.el8.x86_64 #1
[Wed May 17 20:34:57 2023] Hardware name: Supermicro SSG-2028R-NR48N/X10DSC+, BIOS 3.0a 02/09/2018
[Wed May 17 20:34:57 2023] Call Trace:
[Wed May 17 20:34:57 2023] <IRQ>
[Wed May 17 20:34:57 2023] dump_stack+0x5c/0x80
[Wed May 17 20:34:57 2023] nmi_cpu_backtrace.cold.6+0x13/0x4e
[Wed May 17 20:34:57 2023] ?lapic_can_unplug_cpu.cold.28+0x37/0x37
[Wed May 17 20:34:57 2023]nmi_trigger_cpumask_backtrace+0xde/0xe0
[Wed May 17 20:34:57 2023] rcu_dump_cpu_stacks+0x9c/0xca
[Wed May 17 20:34:57 2023] rcu_sched_clock_irq.cold.70+0x1b4/0x3b8
[Wed May 17 20:34:57 2023] ? tick_sched_do_timer+0x60/0x60
[Wed May 17 20:34:57 2023] ? tick_sched_do_timer+0x60/0x60
[Wed May 17 20:34:57 2023] update_process_times+0x24/0x50
[Wed May 17 20:34:57 2023] tick_sched_handle+0x22/0x60
[Wed May 17 20:34:57 2023] tick_sched_timer+0x37/0x70
[Wed May 17 20:34:57 2023] __hrtimer_run_queues+0x100/0x280
[Wed May 17 20:34:57 2023] hrtimer_interrupt+0x100/0x220
[Wed May 17 20:34:57 2023] smp_apic_timer_interrupt+0x6a/0x130
[Wed May 17 20:34:57 2023]apic_timer_interrupt+0xf/0x20
[Wed May 17 20:34:57 2023] </IRQ>
[Wed May 17 20:34:57 2023] RIP: 0010:nvfs_get_pages_free_callback+0x106/0x1e0 [nvidia_fs]
[Wed May 17 20:34:57 2023] Code: 47 20 00 00 00 00 4c 89 ff e8 e6 6f 59 df 48 85 db 74 4c 49 89 df 49 83 ef 18 74 43 48 8b 7b e8 48 8b 0 3 48 85 ff 75 a5 0f 0b <8b> 45 60 83 f8 08 75 f8 bf e3 53 00 00 e8 e8 1e bb df 48 8b 44 24
[Wed May 17 20:34:57 2023] RSP: 0018:ffff980acbeb3b48 EFLAGS: 00000293 ORIG_RAX: ffffffffffffff13
[Wed May 17 20:34:57 2023] RAX: 0000000000000005 RBX: ffff8bd77088b888 RCX: 0000000000000005
[Wed May 17 20:34:57 2023] RDX: 0000000000000006 RSI: 0000000000000001 RDI: ffff8bd7b22c3440
[Wed May 17 20:34:57 2023] RBP: ffff8bd7b22c3400 R08: 000000000000087a R09: 00000000000000007
[Wed May 17 20:34:57 2023] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8bd7b22c3440
[Wed May 17 20:34:57 2023] R13: ffff8bd7b24ffcc0 R14: ffff8bd77088b898 R15: ffff8bd74a0dad08
[Wed May 17 20:34:57 2023] ? nvfs_get_pages_free_callback+0x51/0x1e0 [nvidia_fs]
[Wed May 17 20:34:57 2023] ?os_acquire_spinlock+0xe/0x20 [nvidia]
[Wed May 17 20:34:57 2023] ?_nv040575rm+0x10/0x20 [nvidia]
[Wed May 17 20:34:57 2023] nv_p2p_mem_info_free_callback+0x15/0x30 [nvidia]
[Wed May 17 20:34:57 2023] _nv000082rm+0x59/0x130 [nvidia]
[Wed May 17 20:34:57 2023] ?_nv041401rm+0x1be/0x1d0 [nvidia]
[Wed May 17 20:34:57 2023] ?_nv043322rm+0x1f1/0x300 [nvidia]
[Wed May 17 20:34:57 2023] ?_nv012571rm+0x3dc/0x650 [nvidia]
[Wed May 17 20:34:57 2023] ?_nv041543rm+0x69/0xd0 [nvidia]
[Wed May 17 20:34:57 2023] ?_nv011145rm+0x86/0xa0 [nvidia]
[Wed May 17 20:34:57 2023] ?_nv000707rm+0x871/0xdb0 [nvidia]
[Wed May 17 20:34:57 2023] ? rm_ioctl+0x58/0xb0 [nvidia]
[Wed May 17 20:34:57 2023] ? nvidia_ioctl+0x1e7/0x7e0 [nvidia]
[Wed May 17 20:34:57 2023] ? nvidia_frontend_unlocked_ioctl+0x3a/0x50 [nvidia]
[Wed May 17 20:34:57 2023] ? do_vfs_ioctl+0xa4/0x640
[Wed May 17 20:34:57 2023] ? syscall_trace_enter+0x1d3/0x2c0
[Wed May 17 20:34:57 2023] ? ksys_ioctl+0x60/0x90
[Wed May 17 20:34:57 2023] ? __x64_sys_ioctl+0x16/0x20
[Wed May 17 20:34:57 2023] ? do_syscall_64+0x5b/0x1a0
[Wed May 17 20:34:57 2023] ? entry_SYSCALL_64_after_hwframe+0x65/0xca
You may want to try latest nvidia-fs on github.
It will be ideal to reach out Nvidia if you want to integrate with GDS.
Do you mean that if I want to integrate GDS into my file system, this is only related to nvidia-fs, but not to cuda. If so, I’ll just look into nvidia-fs. I’ve actually done some work, just not quite. The relevant website I researched is the following:
GitHub - NVIDIA/gds-nvidia-fs: NVIDIA GPUDirect Storage Driver.
I found a way to solve my problem. I enable the rdma_dynamic_routing option in /etc/cufile.json (set it to true), and then the program can read and write normally, and the phenomenon of stuck no longer occurs. But I don’t know why. My relevant version information is as follows, who can provide some guidance, thank you.
GDS release version: 1.6.1.9
nvidia_fs version: 2.15 libcufile version: 2.12
Platform: x86_64
hi, Fanyuanli
How did you done these interagtion work? We are also trying the same things to support GDS in our in-house distribute filesytem.
We also mocked as an beegfs fs type, the cuFileHandleRegister and cuFileBufRegiser can be suceess now. But when we inovke cuFileWrite can be sucess, but it will use cuFile IO mode:POSIX. So, the IO is not GDS mode. I wonder how you fix this problem.
I would appreciate if you could kindly provide me some advise.
I am looking forward to you reply.
My email jimhuaang@gmail.com.
Thanks a lot!
Yes, I have fixed my issue. It’s because I missed something of GDS in BeeGFS. If you do it carefully, you can succeed.