GDS cufile-fs: cannot verify RAID members, RAID group size empty

Hey everyone,

after a dist-upgrade on a DGX A100 system, GDS seems to have problems with registering a file handle using the cuFile API. E.g. executing a gdsio read benchmark on a file on the respective filesystem (on a RAID0 device):

gdsio -d 0 -w 16 -s 4G -i 1M -x 0 -I 0 -f <a large file>
file register error: GPUDirect Storage not supported on current file filename : ....

The same error also appears when using the cuFile API directly.
Both gdsio and the API calls directly have worked fine with this file before.
The cufile.log contains the following error:

ERROR  cufio-fs:87 cannot verify RAID members, RAID group size empty
ERROR  cufio-fs:807 RAID member not supported by cuFile in RAID group : /dev/md1
NOTICE  cufio-fs:441 dumping volume attributes: DEVNAME:/dev/md1,ID_FS_TYPE:ext4,ID_FS_USAGE:filesystem,MD_LEVEL:raid0,ext4_journal_mode:ordered,fsid:84b295d89889a3050x,queue/logical_block_size:4096,
DEBUG  cufio_core:1192 cuFile DIO status for file descriptor 3 DirectIO not supported
ERROR  cufio:296 cuFileHandleRegister error, file checks failed
ERROR  cufio:338 cuFileHandleRegister error: GPUDirect Storage not supported on current file

For gdsio, the following transfer types (-x) produce this error: 0,5,6,7. Expectedly, forcing compat mode for transfer type 0 does not produce this error (benchmark runs through without problems).

Any idea what the issue might be? The mount options for this filesystem have (to my knowledge) not been changed before the occurence of this error (see above log for volume attributes dump).

We dont use the NVIDIA Open GPU Kernel Modules, so the packages version are still 2.17.3-1 for nvidia-fs,
and 12.2.1-1 for nvidia-gds.

 GDS release version: 1.7.2.10
 nvidia_fs version:  2.17 libcufile version: 2.12
 Platform: x86_64

Let me know if more information, or the full log file output would help in identifying this problem.
Thanks in advance!

The only other post I found with this error:

This is problem with Distro’s default udev rule settings.

The problem seems to be a change in udev rules with mdadm package
look for following lines and remove “–no-devices”

/lib/udev/rules.d/63-md-raid-arrays.rules

IMPORT{program}="/sbin/mdadm --detail --no-devices --export $devnode"

Thanks a lot, that seems to have fixed it!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.