How do I use Nvidia GDS with NVME without compatability mode?

Hello all,

Context: I am trying to run gds for an application where I have data constantly writing to a buffer on the GPU. To avoid an intermediate buffer on the cpu or CPU involvement, I am trying to use gds to store this data on an NVMe ssd on the same computer.

Problem: I can’t run included samples in non-compatibility mode. I also don’t know if all performance features are enabled.

When I run samples, they seem to work fine until I turn of compatibility mode by changing

{
    "properties": {
        ...
        "allow_compat_mode": true -> false,
        ...
    }
}

in cufile.json.

When running with "allow_compat_mode": true cufile_sample_001 outputs a valid file (hexdump validates that the file is all 0xab and the proper size) and the following is written to the cufile.log

 20-09-2022 01:10:20:839 [pid=26974 tid=26974] ERROR  cufio-topo-nvfs:79 pci device not present in topology device attribute table: 0000:06:00.0
 20-09-2022 01:10:20:839 [pid=26974 tid=26974] ERROR  cufio-topo-nvfs:79 pci device not present in topology device attribute table: 0000:06:00.0
 20-09-2022 01:10:20:840 [pid=26974 tid=26974] ERROR  cufio-topo-nvfs:79 pci device not present in topology device attribute table: 0000:06:00.0
 20-09-2022 01:10:20:842 [pid=26974 tid=26974] ERROR  cufio-fs:834 cuFile does not support file-system type: illegal fstype
 20-09-2022 01:10:20:842 [pid=26974 tid=26974] NOTICE  cufio-fs:408 dumping volume attributes: DEVNAME:/dev/nvme1n1,ID_FS_TYPE:illegal fstype,ID_FS_USAGE:filesystem,UDEV_MODULE:nvme,UDEV_PCI_BRIDGE:0000:00:01.2,device/transport:pcie,fsid:000x,numa_node:0,queue/logical_block_size:4096,wwid:eui.0025385711406d46,
 20-09-2022 01:10:20:842 [pid=26974 tid=26974] NOTICE  cufio:1036 cuFileHandleRegister GDS not supported or disabled by config, using cuFile posix read/write with compat mode enabled

standard out prints the following

opening file /mnt/gds-data/test-file
registering device memory of size :131072
writing from device memory
written bytes :131072
deregistering device memory

However when "allow_compat_mode": false and cufile_sample_001 is run. A empty file is created and
the following is written to cufile.log

 20-09-2022 01:11:56:949 [pid=27360 tid=27360] ERROR  cufio-topo-nvfs:79 pci device not present in topology device attribute table: 0000:06:00.0
 20-09-2022 01:11:56:949 [pid=27360 tid=27360] ERROR  cufio-topo-nvfs:79 pci device not present in topology device attribute table: 0000:06:00.0
 20-09-2022 01:11:56:949 [pid=27360 tid=27360] ERROR  cufio-topo-nvfs:79 pci device not present in topology device attribute table: 0000:06:00.0
 20-09-2022 01:11:56:966 [pid=27360 tid=27360] ERROR  cufio-fs:834 cuFile does not support file-system type: illegal fstype
 20-09-2022 01:11:56:966 [pid=27360 tid=27360] NOTICE  cufio-fs:408 dumping volume attributes: DEVNAME:/dev/nvme1n1,ID_FS_TYPE:illegal fstype,ID_FS_USAGE:filesystem,UDEV_MODULE:nvme,UDEV_PCI_BRIDGE:0000:00:01.2,device/transport:pcie,fsid:000x,numa_node:0,queue/logical_block_size:4096,wwid:eui.0025385711406d46,
 20-09-2022 01:11:56:966 [pid=27360 tid=27360] ERROR  cufio:1039 cuFileHandleRegister error, file checks failed
 20-09-2022 01:11:56:966 [pid=27360 tid=27360] ERROR  cufio:1082 cuFileHandleRegister error: GPUDirect Storage not supported on current file

Stdout prints the following

opening file /mnt/gds-data/test-file
file register error:GPUDirect Storage not supported on current file

How do I get this to work in non-compatability mode?

For extra information, the following is the output when running python gdscheck.py -p.

 GDS release version: 1.3.1.18
 nvidia_fs version:  2.12 libcufile version: 2.12
 Platform: x86_64
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe               : Supported
 NVMeOF             : Unsupported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 BeeGFS             : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Disabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_compat_mode : false
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384 
 properties.posix_pool_slab_count : 128 64 32 
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.beegfs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 =========
 GPU INFO:
 =========
 GPU index 0 NVIDIA RTX A5000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
 ==============
 PLATFORM INFO:
 ==============
 Found ACS enabled for switch 0000:00:03.1
 IOMMU: disabled
 Platform verification succeeded

Hi,
I have the same problem. Could anyone share the solution for this issue?

@nnaza008 m also facing the same problem , anyone has got any solution ?

@user99287 @nnaza008 , I am stuck @ the same issue .

i have opened another discussion on NVMe support and errors listed with GDS .
Please have a look and share anything which can be usefull in setting up NVMe correctly for GDS .
There is not enough materials available , so just relying on forums and community for any help :) .

What filesystem is installed on the device /dev/nvme1n1.

GDS is only supported on EXT4 with ordered mode and XFS.