Issues Running GPU Direct Storage Benchmark on Single-GPU Setup with CUDA 12.6

Hello everyone,

I’m encountering an issue while running benchmarks with GPU Direct Storage.

The benchmark I’m using is from this repository, and I’m running it via the Docker setup provided in the repository.

Since I only have a single GPU, I set num_gpu to 1 in run_benchmark.sh

I checked the GPU node logs using the dmesg command:

$ dmesg
[249520.085383] NVRM: RmThirdPartyP2PBAR1GetPages: no space for BAR1 mappings, length: 0x200000
[249520.085413] nvidia-fs:nvfs_pin_gpu_pages:1321 Error ret -12 invoking nvidia_p2p_get_pages_persistent
                 va_start=0x408c9e00000/va_end=0x408c9ffffff/rounded_size=0x200000/gpu_buf_length=0x200000
[249520.086320] NVRM: RmThirdPartyP2PBAR1GetPages: no space for BAR1 mappings, length: 0x200000
[249520.086350] nvidia-fs:nvfs_pin_gpu_pages:1321 Error ret -12 invoking nvidia_p2p_get_pages_persistent
                 va_start=0x409ca000000/va_end=0x409ca1fffff/rounded_size=0x200000/gpu_buf_length=0x200000
[249520.087328] NVRM: RmThirdPartyP2PBAR1GetPages: no space for BAR1 mappings, length: 0x200000
[249520.087364] nvidia-fs:nvfs_pin_gpu_pages:1321 Error ret -12 invoking nvidia_p2p_get_pages_persistent
                 va_start=0x40a4a200000/va_end=0x40a4a3fffff/rounded_size=0x200000/gpu_buf_length=0x200000
[249520.088303] NVRM: RmThirdPartyP2PBAR1GetPages: no space for BAR1 mappings, length: 0x200000
[249520.088337] nvidia-fs:nvfs_pin_gpu_pages:1321 Error ret -12 invoking nvidia_p2p_get_pages_persistent
                 va_start=0x40a8a400000/va_end=0x40a8a5fffff/rounded_size=0x200000/gpu_buf_length=0x200000
[249520.089408] NVRM: RmThirdPartyP2PBAR1GetPages: no space for BAR1 mappings, length: 0x200000
[249520.089443] nvidia-fs:nvfs_pin_gpu_pages:1321 Error ret -12 invoking nvidia_p2p_get_pages_persistent
                 va_start=0x40aaa600000/va_end=0x40aaa7fffff/rounded_size=0x200000/gpu_buf_length=0x200000
[249520.090361] NVRM: RmThirdPartyP2PBAR1GetPages: no space for BAR1 mappings, length: 0x200000
[249520.090396] nvidia-fs:nvfs_pin_gpu_pages:1321 Error ret -12 invoking nvidia_p2p_get_pages_persistent
                 va_start=0x40aab800000/va_end=0x40aab9fffff/rounded_size=0x200000/gpu_buf_length=0x200000

In the container’s cufile.log, I also noticed the following errors:

28-10-2024 06:13:59:949 [pid=1064 tid=1118] ERROR  cufio-obj:101 error allocating nvfs handle, size: 2097152
28-10-2024 06:13:59:949 [pid=1064 tid=1118] ERROR  cufio:1170 cuFileBufRegister error, object allocation failed
28-10-2024 06:13:59:949 [pid=1064 tid=1118] ERROR  cufio:1218 cuFileBufRegister error Failed to allocate pinned GPU Memory
28-10-2024 06:13:59:950 [pid=1064 tid=1118] ERROR  0:491 nvidia-fs MAP ioctl failed : -1
28-10-2024 06:13:59:950 [pid=1064 tid=1118] ERROR  0:506 map failed

Does anyone have any suggestions or need more information?
Thank you all!


Here is my server information:

OS: Ubuntu 20.04

$ nvidia-smi
Mon Oct 28 14:02:03 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro RTX 5000                Off |   00000001:9B:00.0 Off |                  Off |
| 33%   27C    P8              2W /  230W |       6MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

I also ran gdscheck.py to verify the environment status.

$ python3 /usr/local/cuda-12.6/gds/tools/gdscheck.py -p
 GDS release version: 1.11.1.6
 nvidia_fs version:  2.22 libcufile version: 2.12
 Platform: x86_64
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe               : Supported
 NVMeOF             : Supported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 BeeGFS             : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Enabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_compat_mode : true
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384
 properties.posix_pool_slab_count : 128 64 32
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.beegfs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 fs.gpfs.gds_write_support: false
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 execution.max_io_threads : 4
 execution.max_io_queue_depth : 128
 execution.parallel_io : true
 execution.min_io_threshold_size_kb : 8192
 execution.max_request_parallelism : 4
 properties.force_odirect_mode : false
 properties.prefer_iouring : false
 =========
 GPU INFO:
 =========
 GPU index 0 Quadro RTX 5000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
 ==============
 PLATFORM INFO:
 ==============
 IOMMU: disabled
 Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
 Cuda Driver Version Installed:  12060
 Platform: KS 5000U, Arch: x86_64(Linux 5.15.0-124-generic)
 Platform verification succeeded

Running gdsio works without any issues, and there are no errors in cufile.log.

./gdsio -f {NVMe_Path}/testfile -d 0 -w 16 -s 20G -i 1M -x 0 -I 1
IoType: WRITE XferType: GPUD Threads: 16 DataSetSize: 19604480/20971520(KiB) IOSize: 1024(KiB) Throughput: 8.830188 GiB/sec, Avg_Latency: 1772.514806 usecs ops: 19145 total_time 2.117315 secs

Hi Barry Cheng,

Quadro RTX 5000 has only 256 MiB of BAR1 memory. This limits the amount of memory registration that can be performed on this GPU.

The deepcam inference benchmark looks like needs more BAR1 registration with your current settings.

Check this page for tunables

https://docs.nvidia.com/deeplearning/dali/user-guide/docs/operations/nvidia.dali.fn.readers.numpy.html

skip registration of buffers
register_buffers=False,

tune the GDS_CHUNK_SIZE to be smaller to allow for total BAR1 memory under 220 MiB

1 Like

Hi kmodukuri,

Wow, that’s awesome! I tried changing GDS_CHUNK_SIZE to 1M, and it worked perfectly.

export DALI_GDS_CHUNK_SIZE=1M

Thanks a lot for your help!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.