Nvidia-fs MAP ioctl failed : ioctl_return: -22 ioctl_ret: -1 with GPU Direct Storage

I want to use the GPU Direct storage feature. But when I am running sample cufile_sample_001 (MagnumIO/gds/samples at main · NVIDIA/MagnumIO · GitHub) with this command (sudo ./cufile_sample_001 /mnt/nvme/test.txt CUDA:0), I am getting following error

08-03-2024 01:29:42:0 [pid=21748 tid=21748] ERROR 0:501 nvidia-fs MAP ioctl failed : ioctl_return: -22 ioctl_ret: -1
08-03-2024 01:29:42:0 [pid=21748 tid=21748] ERROR 0:515 map failed

08-03-2024 01:29:42:0 [pid=21748 tid=21748] ERROR cufio-obj:129 error allocating nvfs handle, size: 131072
08-03-2024 01:29:42:0 [pid=21748 tid=21748] ERROR cufio_core:1589 cuFileBufRegister error, object allocation failed
08-03-2024 01:29:42:0 [pid=21748 tid=21748] ERROR cufio_core:1667 cuFileBufRegister error cufile success
08-03-2024 01:29:42:1 [pid=21748 tid=21748] ERROR 0:501 nvidia-fs MAP ioctl failed : ioctl_return: -22 ioctl_ret: -1
08-03-2024 01:29:42:1 [pid=21748 tid=21748] ERROR 0:515 map failed

08-03-2024 01:29:42:1 [pid=21748 tid=21748] ERROR 0:829 Buffer map failed for PCI-Group: 0 GPU: 0
08-03-2024 01:29:42:1 [pid=21748 tid=21748] ERROR 0:957 Failed to obtain bounce buffer from domain: 0 GPU: 0
08-03-2024 01:29:42:1 [pid=21748 tid=21748] ERROR 0:1234 failed to get bounce buffer for PCI group 0 GPU 0
08-03-2024 01:29:42:1 [pid=21748 tid=21748] ERROR cufio:145 cuFileBufDeregister error, object for device pointer is not registered
08-03-2024 01:29:42:1 [pid=21748 tid=21748] ERROR cufio:171 cuFileBufDeregister error: device pointer lookup failure

ofed_info -s
MLNX_OFED_LINUX-5.8-4.1.5.0:

#python3 /usr/local/cuda/gds/tools/gdscheck.py -p
GDS release version: 1.8.0.34
nvidia_fs version: 2.17 libcufile version: 2.12
Platform: x86_64

ENVIRONMENT:

=====================
DRIVER CONFIGURATION:

NVMe : Supported
NVMeOF : Unsupported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
BeeGFS : Unsupported
WekaFS : Unsupported
Userspace RDMA : Unsupported
–Mellanox PeerDirect : Disabled
–rdma library : Not Loaded (libcufile_rdma.so)
–rdma devices : Not configured
–rdma_device_status : Up: 0 Down: 0

CUFILE CONFIGURATION:

properties.use_compat_mode : true
properties.force_compat_mode : false
properties.gds_rdma_write_support : true
properties.use_poll_mode : false
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_size : 128
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 16384
properties.max_device_cache_size_kb : 131072
properties.max_device_pinned_mem_size_kb : 33554432
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 32
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 0
fs.generic.posix_unaligned_writes : false
fs.lustre.posix_gds_min_kb: 0
fs.beegfs.posix_gds_min_kb: 0
fs.weka.rdma_write_support: false
fs.gpfs.gds_write_support: false
profile.nvtx : false
profile.cufile_stats : 0
miscellaneous.api_check_aggressive : false
execution.max_io_threads : 4
execution.max_io_queue_depth : 128
execution.parallel_io : true
execution.min_io_threshold_size_kb : 8192
execution.max_request_parallelism : 4
properties.force_odirect_mode : false
properties.prefer_iouring : false

GPU INFO:

GPU index 0 NVIDIA GeForce RTX 3070 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled

PLATFORM INFO:

IOMMU: disabled
Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
Cuda Driver Version Installed: 12030
Platform: SYS-7049GP-TRT, Arch: x86_64(Linux 5.15.0-100-generic)
Platform verification succeeded

Cuda Toolkit 12.3
nvidia-fs driver version 2.17.5
GPU: NVIDIA GeForce RTX 3070

For GeForce Support, I gave following command according to cuda toolkit installation provided by nvidia
echo “options nvidia NVreg_OpenRmEnableUnsupportedGpus=1” | sudo tee /etc/modprobe.d/nvidia-gsp.conf

mount | grep ext4 | grep nvme
/dev/nvme1n1 on /mnt/nvme type ext4 (rw,relatime,data=ordered)

I am stuck with this problem for a few days but could not solve. Kindly help me. Let me know if you need other informations.

GDS P2P mode is only supported on Data center GPUs. Tesla or Quadro models.
RTX 3070 is not supported.

1 Like

Later I installed GPU Direct storage in A100 GPU. Is there any way to confirm that, GDS is working properly? I ran the sample codes and those were running fine. But How do I check that the GPU is directly communicating with the NVMe SSD? Is there any profiling mechanism?

you can enable the IO counters as super user

echo 1 > /sys/module/nvidia_fs/parameters/rw_stats_enabled

then you can run
watch cat /proc/driver/nvidia-fs/stats

please not that stats can have overhead for smaller IO sizes < 32K.

1 Like