Nvidia-fs MAP ioctl failed : ioctl_return: -22 ioctl_ret: -1 with GPU Direct Storage

sabihaafroz · March 8, 2024, 2:50am

I want to use the GPU Direct storage feature. But when I am running sample cufile_sample_001 (MagnumIO/gds/samples at main · NVIDIA/MagnumIO · GitHub) with this command (sudo ./cufile_sample_001 /mnt/nvme/test.txt CUDA:0), I am getting following error

08-03-2024 01:29:42:0 [pid=21748 tid=21748] ERROR 0:501 nvidia-fs MAP ioctl failed : ioctl_return: -22 ioctl_ret: -1
08-03-2024 01:29:42:0 [pid=21748 tid=21748] ERROR 0:515 map failed

08-03-2024 01:29:42:0 [pid=21748 tid=21748] ERROR cufio-obj:129 error allocating nvfs handle, size: 131072
08-03-2024 01:29:42:0 [pid=21748 tid=21748] ERROR cufio_core:1589 cuFileBufRegister error, object allocation failed
08-03-2024 01:29:42:0 [pid=21748 tid=21748] ERROR cufio_core:1667 cuFileBufRegister error cufile success
08-03-2024 01:29:42:1 [pid=21748 tid=21748] ERROR 0:501 nvidia-fs MAP ioctl failed : ioctl_return: -22 ioctl_ret: -1
08-03-2024 01:29:42:1 [pid=21748 tid=21748] ERROR 0:515 map failed

08-03-2024 01:29:42:1 [pid=21748 tid=21748] ERROR 0:829 Buffer map failed for PCI-Group: 0 GPU: 0
08-03-2024 01:29:42:1 [pid=21748 tid=21748] ERROR 0:957 Failed to obtain bounce buffer from domain: 0 GPU: 0
08-03-2024 01:29:42:1 [pid=21748 tid=21748] ERROR 0:1234 failed to get bounce buffer for PCI group 0 GPU 0
08-03-2024 01:29:42:1 [pid=21748 tid=21748] ERROR cufio:145 cuFileBufDeregister error, object for device pointer is not registered
08-03-2024 01:29:42:1 [pid=21748 tid=21748] ERROR cufio:171 cuFileBufDeregister error: device pointer lookup failure

ofed_info -s
MLNX_OFED_LINUX-5.8-4.1.5.0:

#python3 /usr/local/cuda/gds/tools/gdscheck.py -p
GDS release version: 1.8.0.34
nvidia_fs version: 2.17 libcufile version: 2.12
Platform: x86_64

ENVIRONMENT:

=====================
DRIVER CONFIGURATION:

NVMe : Supported
NVMeOF : Unsupported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
BeeGFS : Unsupported
WekaFS : Unsupported
Userspace RDMA : Unsupported
–Mellanox PeerDirect : Disabled
–rdma library : Not Loaded (libcufile_rdma.so)
–rdma devices : Not configured
–rdma_device_status : Up: 0 Down: 0

CUFILE CONFIGURATION:

properties.use_compat_mode : true
properties.force_compat_mode : false
properties.gds_rdma_write_support : true
properties.use_poll_mode : false
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_size : 128
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 16384
properties.max_device_cache_size_kb : 131072
properties.max_device_pinned_mem_size_kb : 33554432
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 32
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 0
fs.generic.posix_unaligned_writes : false
fs.lustre.posix_gds_min_kb: 0
fs.beegfs.posix_gds_min_kb: 0
fs.weka.rdma_write_support: false
fs.gpfs.gds_write_support: false
profile.nvtx : false
profile.cufile_stats : 0
miscellaneous.api_check_aggressive : false
execution.max_io_threads : 4
execution.max_io_queue_depth : 128
execution.parallel_io : true
execution.min_io_threshold_size_kb : 8192
execution.max_request_parallelism : 4
properties.force_odirect_mode : false
properties.prefer_iouring : false

GPU INFO:

GPU index 0 NVIDIA GeForce RTX 3070 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled

PLATFORM INFO:

IOMMU: disabled
Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
Cuda Driver Version Installed: 12030
Platform: SYS-7049GP-TRT, Arch: x86_64(Linux 5.15.0-100-generic)
Platform verification succeeded

Cuda Toolkit 12.3
nvidia-fs driver version 2.17.5
GPU: NVIDIA GeForce RTX 3070

For GeForce Support, I gave following command according to cuda toolkit installation provided by nvidia
echo “options nvidia NVreg_OpenRmEnableUnsupportedGpus=1” | sudo tee /etc/modprobe.d/nvidia-gsp.conf

mount | grep ext4 | grep nvme
/dev/nvme1n1 on /mnt/nvme type ext4 (rw,relatime,data=ordered)

I am stuck with this problem for a few days but could not solve. Kindly help me. Let me know if you need other informations.

kmodukuri · March 21, 2024, 3:46pm

GDS P2P mode is only supported on Data center GPUs. Tesla or Quadro models.
RTX 3070 is not supported.

sabihaafroz · March 22, 2024, 3:44pm

Later I installed GPU Direct storage in A100 GPU. Is there any way to confirm that, GDS is working properly? I ran the sample codes and those were running fine. But How do I check that the GPU is directly communicating with the NVMe SSD? Is there any profiling mechanism?

kmodukuri · March 22, 2024, 5:23pm

you can enable the IO counters as super user

echo 1 > /sys/module/nvidia_fs/parameters/rw_stats_enabled

then you can run
watch cat /proc/driver/nvidia-fs/stats

please not that stats can have overhead for smaller IO sizes < 32K.

Topic		Replies	Views
GPUDirectStorage cuFileWrite() error with RTX A4000 GPU-Accelerated Libraries gds	3	124	October 9, 2024
How do I use Nvidia GDS with NVME without compatability mode? GPU-Accelerated Libraries cuda , gds	4	2095	November 22, 2023
Issues Running GPU Direct Storage Benchmark on Single-GPU Setup with CUDA 12.6 GPU-Accelerated Libraries gds	3	89	November 12, 2024
GPUDirect Storage access remote SSD Storage cuda , gds	2	35	May 8, 2025
Nvme unsupported from "gdscheck" in GPU direct storage GPU-Accelerated Libraries nvme	3	3351	April 17, 2023
cuFileHandleRegister returned an 'internal error' error when using GPUDirect Storage technology on BeeGFS GPU-Accelerated Libraries gds	6	1230	December 13, 2023
Getting BeeGFS to show up in gdscheck.py platform check GPU-Accelerated Libraries gds	4	858	August 1, 2023
GDS performance test results are not as expected GPU-Accelerated Libraries gds	2	650	May 24, 2023
GDS error: nvidia-fs MAP ioctl failed GPU-Accelerated Libraries gds	9	1793	May 6, 2023
Failing to use gds on A4000 Storage gds	5	58	March 19, 2025