Hello all,
Context: I am trying to run gds for an application where I have data constantly writing to a buffer on the GPU. To avoid an intermediate buffer on the cpu or CPU involvement, I am trying to use gds to store this data on an NVMe ssd on the same computer.
Problem: I can’t run included samples in non-compatibility mode. I also don’t know if all performance features are enabled.
When I run samples, they seem to work fine until I turn of compatibility mode by changing
{
"properties": {
...
"allow_compat_mode": true -> false,
...
}
}
in cufile.json.
When running with "allow_compat_mode": true
cufile_sample_001 outputs a valid file (hexdump validates that the file is all 0xab and the proper size) and the following is written to the cufile.log
20-09-2022 01:10:20:839 [pid=26974 tid=26974] ERROR cufio-topo-nvfs:79 pci device not present in topology device attribute table: 0000:06:00.0
20-09-2022 01:10:20:839 [pid=26974 tid=26974] ERROR cufio-topo-nvfs:79 pci device not present in topology device attribute table: 0000:06:00.0
20-09-2022 01:10:20:840 [pid=26974 tid=26974] ERROR cufio-topo-nvfs:79 pci device not present in topology device attribute table: 0000:06:00.0
20-09-2022 01:10:20:842 [pid=26974 tid=26974] ERROR cufio-fs:834 cuFile does not support file-system type: illegal fstype
20-09-2022 01:10:20:842 [pid=26974 tid=26974] NOTICE cufio-fs:408 dumping volume attributes: DEVNAME:/dev/nvme1n1,ID_FS_TYPE:illegal fstype,ID_FS_USAGE:filesystem,UDEV_MODULE:nvme,UDEV_PCI_BRIDGE:0000:00:01.2,device/transport:pcie,fsid:000x,numa_node:0,queue/logical_block_size:4096,wwid:eui.0025385711406d46,
20-09-2022 01:10:20:842 [pid=26974 tid=26974] NOTICE cufio:1036 cuFileHandleRegister GDS not supported or disabled by config, using cuFile posix read/write with compat mode enabled
standard out prints the following
opening file /mnt/gds-data/test-file
registering device memory of size :131072
writing from device memory
written bytes :131072
deregistering device memory
However when "allow_compat_mode": false
and cufile_sample_001 is run. A empty file is created and
the following is written to cufile.log
20-09-2022 01:11:56:949 [pid=27360 tid=27360] ERROR cufio-topo-nvfs:79 pci device not present in topology device attribute table: 0000:06:00.0
20-09-2022 01:11:56:949 [pid=27360 tid=27360] ERROR cufio-topo-nvfs:79 pci device not present in topology device attribute table: 0000:06:00.0
20-09-2022 01:11:56:949 [pid=27360 tid=27360] ERROR cufio-topo-nvfs:79 pci device not present in topology device attribute table: 0000:06:00.0
20-09-2022 01:11:56:966 [pid=27360 tid=27360] ERROR cufio-fs:834 cuFile does not support file-system type: illegal fstype
20-09-2022 01:11:56:966 [pid=27360 tid=27360] NOTICE cufio-fs:408 dumping volume attributes: DEVNAME:/dev/nvme1n1,ID_FS_TYPE:illegal fstype,ID_FS_USAGE:filesystem,UDEV_MODULE:nvme,UDEV_PCI_BRIDGE:0000:00:01.2,device/transport:pcie,fsid:000x,numa_node:0,queue/logical_block_size:4096,wwid:eui.0025385711406d46,
20-09-2022 01:11:56:966 [pid=27360 tid=27360] ERROR cufio:1039 cuFileHandleRegister error, file checks failed
20-09-2022 01:11:56:966 [pid=27360 tid=27360] ERROR cufio:1082 cuFileHandleRegister error: GPUDirect Storage not supported on current file
Stdout prints the following
opening file /mnt/gds-data/test-file
file register error:GPUDirect Storage not supported on current file
How do I get this to work in non-compatability mode?
For extra information, the following is the output when running python gdscheck.py -p
.
GDS release version: 1.3.1.18
nvidia_fs version: 2.12 libcufile version: 2.12
Platform: x86_64
============
ENVIRONMENT:
============
=====================
DRIVER CONFIGURATION:
=====================
NVMe : Supported
NVMeOF : Unsupported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
BeeGFS : Unsupported
WekaFS : Unsupported
Userspace RDMA : Unsupported
--Mellanox PeerDirect : Disabled
--rdma library : Not Loaded (libcufile_rdma.so)
--rdma devices : Not configured
--rdma_device_status : Up: 0 Down: 0
=====================
CUFILE CONFIGURATION:
=====================
properties.use_compat_mode : false
properties.force_compat_mode : false
properties.gds_rdma_write_support : true
properties.use_poll_mode : false
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_size : 128
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 16384
properties.max_device_cache_size_kb : 131072
properties.max_device_pinned_mem_size_kb : 33554432
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 32
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 0
fs.generic.posix_unaligned_writes : false
fs.lustre.posix_gds_min_kb: 0
fs.beegfs.posix_gds_min_kb: 0
fs.weka.rdma_write_support: false
profile.nvtx : false
profile.cufile_stats : 0
miscellaneous.api_check_aggressive : false
=========
GPU INFO:
=========
GPU index 0 NVIDIA RTX A5000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
==============
PLATFORM INFO:
==============
Found ACS enabled for switch 0000:00:03.1
IOMMU: disabled
Platform verification succeeded