Issues Running GPU Direct Storage Benchmark on Single-GPU Setup with CUDA 12.6

barry.cheng · October 28, 2024, 6:37am

Hello everyone,

I’m encountering an issue while running benchmarks with GPU Direct Storage.

The benchmark I’m using is from this repository, and I’m running it via the Docker setup provided in the repository.

Since I only have a single GPU, I set num_gpu to 1 in run_benchmark.sh

I checked the GPU node logs using the dmesg command:

$ dmesg
[249520.085383] NVRM: RmThirdPartyP2PBAR1GetPages: no space for BAR1 mappings, length: 0x200000
[249520.085413] nvidia-fs:nvfs_pin_gpu_pages:1321 Error ret -12 invoking nvidia_p2p_get_pages_persistent
                 va_start=0x408c9e00000/va_end=0x408c9ffffff/rounded_size=0x200000/gpu_buf_length=0x200000
[249520.086320] NVRM: RmThirdPartyP2PBAR1GetPages: no space for BAR1 mappings, length: 0x200000
[249520.086350] nvidia-fs:nvfs_pin_gpu_pages:1321 Error ret -12 invoking nvidia_p2p_get_pages_persistent
                 va_start=0x409ca000000/va_end=0x409ca1fffff/rounded_size=0x200000/gpu_buf_length=0x200000
[249520.087328] NVRM: RmThirdPartyP2PBAR1GetPages: no space for BAR1 mappings, length: 0x200000
[249520.087364] nvidia-fs:nvfs_pin_gpu_pages:1321 Error ret -12 invoking nvidia_p2p_get_pages_persistent
                 va_start=0x40a4a200000/va_end=0x40a4a3fffff/rounded_size=0x200000/gpu_buf_length=0x200000
[249520.088303] NVRM: RmThirdPartyP2PBAR1GetPages: no space for BAR1 mappings, length: 0x200000
[249520.088337] nvidia-fs:nvfs_pin_gpu_pages:1321 Error ret -12 invoking nvidia_p2p_get_pages_persistent
                 va_start=0x40a8a400000/va_end=0x40a8a5fffff/rounded_size=0x200000/gpu_buf_length=0x200000
[249520.089408] NVRM: RmThirdPartyP2PBAR1GetPages: no space for BAR1 mappings, length: 0x200000
[249520.089443] nvidia-fs:nvfs_pin_gpu_pages:1321 Error ret -12 invoking nvidia_p2p_get_pages_persistent
                 va_start=0x40aaa600000/va_end=0x40aaa7fffff/rounded_size=0x200000/gpu_buf_length=0x200000
[249520.090361] NVRM: RmThirdPartyP2PBAR1GetPages: no space for BAR1 mappings, length: 0x200000
[249520.090396] nvidia-fs:nvfs_pin_gpu_pages:1321 Error ret -12 invoking nvidia_p2p_get_pages_persistent
                 va_start=0x40aab800000/va_end=0x40aab9fffff/rounded_size=0x200000/gpu_buf_length=0x200000

In the container’s cufile.log, I also noticed the following errors:

28-10-2024 06:13:59:949 [pid=1064 tid=1118] ERROR  cufio-obj:101 error allocating nvfs handle, size: 2097152
28-10-2024 06:13:59:949 [pid=1064 tid=1118] ERROR  cufio:1170 cuFileBufRegister error, object allocation failed
28-10-2024 06:13:59:949 [pid=1064 tid=1118] ERROR  cufio:1218 cuFileBufRegister error Failed to allocate pinned GPU Memory
28-10-2024 06:13:59:950 [pid=1064 tid=1118] ERROR  0:491 nvidia-fs MAP ioctl failed : -1
28-10-2024 06:13:59:950 [pid=1064 tid=1118] ERROR  0:506 map failed

Does anyone have any suggestions or need more information?
Thank you all!

Here is my server information:

OS: Ubuntu 20.04

$ nvidia-smi
Mon Oct 28 14:02:03 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro RTX 5000                Off |   00000001:9B:00.0 Off |                  Off |
| 33%   27C    P8              2W /  230W |       6MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

I also ran gdscheck.py to verify the environment status.

$ python3 /usr/local/cuda-12.6/gds/tools/gdscheck.py -p
 GDS release version: 1.11.1.6
 nvidia_fs version:  2.22 libcufile version: 2.12
 Platform: x86_64
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe               : Supported
 NVMeOF             : Supported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 BeeGFS             : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Enabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_compat_mode : true
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384
 properties.posix_pool_slab_count : 128 64 32
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.beegfs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 fs.gpfs.gds_write_support: false
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 execution.max_io_threads : 4
 execution.max_io_queue_depth : 128
 execution.parallel_io : true
 execution.min_io_threshold_size_kb : 8192
 execution.max_request_parallelism : 4
 properties.force_odirect_mode : false
 properties.prefer_iouring : false
 =========
 GPU INFO:
 =========
 GPU index 0 Quadro RTX 5000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
 ==============
 PLATFORM INFO:
 ==============
 IOMMU: disabled
 Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
 Cuda Driver Version Installed:  12060
 Platform: KS 5000U, Arch: x86_64(Linux 5.15.0-124-generic)
 Platform verification succeeded

Running gdsio works without any issues, and there are no errors in cufile.log.

./gdsio -f {NVMe_Path}/testfile -d 0 -w 16 -s 20G -i 1M -x 0 -I 1
IoType: WRITE XferType: GPUD Threads: 16 DataSetSize: 19604480/20971520(KiB) IOSize: 1024(KiB) Throughput: 8.830188 GiB/sec, Avg_Latency: 1772.514806 usecs ops: 19145 total_time 2.117315 secs

kmodukuri · October 28, 2024, 4:03pm

Hi Barry Cheng,

Quadro RTX 5000 has only 256 MiB of BAR1 memory. This limits the amount of memory registration that can be performed on this GPU.

The deepcam inference benchmark looks like needs more BAR1 registration with your current settings.

Check this page for tunables

https://docs.nvidia.com/deeplearning/dali/user-guide/docs/operations/nvidia.dali.fn.readers.numpy.html

skip registration of buffers
register_buffers=False,

tune the GDS_CHUNK_SIZE to be smaller to allow for total BAR1 memory under 220 MiB

barry.cheng · October 29, 2024, 5:23am

Hi kmodukuri,

Wow, that’s awesome! I tried changing GDS_CHUNK_SIZE to 1M, and it worked perfectly.

export DALI_GDS_CHUNK_SIZE=1M

Thanks a lot for your help!

system · November 12, 2024, 5:24am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Nvidia-fs MAP ioctl failed : ioctl_return: -22 ioctl_ret: -1 with GPU Direct Storage GPU-Accelerated Libraries gds	3	621	March 22, 2024
Dose not T1000 GPU support GPU Direct Storage? GPU-Accelerated Libraries gds	1	123	January 26, 2026
How to verify GPUDirect Storage's P2P DMA is working correctly for local attached NVMe SSD? GPU-Accelerated Libraries nvme , gds	0	451	August 28, 2024
Nvidia_p2p_get_pages() failing with error code -22 GPU - Hardware gds	4	1258	September 22, 2023
cuFileHandleRegister returned an 'internal error' error when using GPUDirect Storage technology on BeeGFS GPU-Accelerated Libraries gds	6	1422	December 13, 2023
GDS read/write error:use gdsio -I 1 to write GPU-Accelerated Libraries gds	4	146	September 22, 2025
GPUDirectStorage cuFileWrite() error with RTX A4000 GPU-Accelerated Libraries gds	3	259	October 9, 2024
GPUDirect Storage access remote SSD Storage cuda , gds	2	335	May 8, 2025
GPUDirect Storage: "Non-registered case, not yet implemented" error GPU-Accelerated Libraries gds	0	550	May 19, 2023
GDS on NVMe-oF (RDMA) reports "No matching pair for network device to closest GPU" although RDMA devices are up GPU-Accelerated Libraries cuda	0	88	July 30, 2025

Issues Running GPU Direct Storage Benchmark on Single-GPU Setup with CUDA 12.6

Related topics