Hello everyone,
I’m encountering an issue while running benchmarks with GPU Direct Storage.
The benchmark I’m using is from this repository, and I’m running it via the Docker setup provided in the repository.
Since I only have a single GPU, I set num_gpu
to 1 in run_benchmark.sh
I checked the GPU node logs using the dmesg
command:
$ dmesg
[249520.085383] NVRM: RmThirdPartyP2PBAR1GetPages: no space for BAR1 mappings, length: 0x200000
[249520.085413] nvidia-fs:nvfs_pin_gpu_pages:1321 Error ret -12 invoking nvidia_p2p_get_pages_persistent
va_start=0x408c9e00000/va_end=0x408c9ffffff/rounded_size=0x200000/gpu_buf_length=0x200000
[249520.086320] NVRM: RmThirdPartyP2PBAR1GetPages: no space for BAR1 mappings, length: 0x200000
[249520.086350] nvidia-fs:nvfs_pin_gpu_pages:1321 Error ret -12 invoking nvidia_p2p_get_pages_persistent
va_start=0x409ca000000/va_end=0x409ca1fffff/rounded_size=0x200000/gpu_buf_length=0x200000
[249520.087328] NVRM: RmThirdPartyP2PBAR1GetPages: no space for BAR1 mappings, length: 0x200000
[249520.087364] nvidia-fs:nvfs_pin_gpu_pages:1321 Error ret -12 invoking nvidia_p2p_get_pages_persistent
va_start=0x40a4a200000/va_end=0x40a4a3fffff/rounded_size=0x200000/gpu_buf_length=0x200000
[249520.088303] NVRM: RmThirdPartyP2PBAR1GetPages: no space for BAR1 mappings, length: 0x200000
[249520.088337] nvidia-fs:nvfs_pin_gpu_pages:1321 Error ret -12 invoking nvidia_p2p_get_pages_persistent
va_start=0x40a8a400000/va_end=0x40a8a5fffff/rounded_size=0x200000/gpu_buf_length=0x200000
[249520.089408] NVRM: RmThirdPartyP2PBAR1GetPages: no space for BAR1 mappings, length: 0x200000
[249520.089443] nvidia-fs:nvfs_pin_gpu_pages:1321 Error ret -12 invoking nvidia_p2p_get_pages_persistent
va_start=0x40aaa600000/va_end=0x40aaa7fffff/rounded_size=0x200000/gpu_buf_length=0x200000
[249520.090361] NVRM: RmThirdPartyP2PBAR1GetPages: no space for BAR1 mappings, length: 0x200000
[249520.090396] nvidia-fs:nvfs_pin_gpu_pages:1321 Error ret -12 invoking nvidia_p2p_get_pages_persistent
va_start=0x40aab800000/va_end=0x40aab9fffff/rounded_size=0x200000/gpu_buf_length=0x200000
In the container’s cufile.log
, I also noticed the following errors:
28-10-2024 06:13:59:949 [pid=1064 tid=1118] ERROR cufio-obj:101 error allocating nvfs handle, size: 2097152
28-10-2024 06:13:59:949 [pid=1064 tid=1118] ERROR cufio:1170 cuFileBufRegister error, object allocation failed
28-10-2024 06:13:59:949 [pid=1064 tid=1118] ERROR cufio:1218 cuFileBufRegister error Failed to allocate pinned GPU Memory
28-10-2024 06:13:59:950 [pid=1064 tid=1118] ERROR 0:491 nvidia-fs MAP ioctl failed : -1
28-10-2024 06:13:59:950 [pid=1064 tid=1118] ERROR 0:506 map failed
Does anyone have any suggestions or need more information?
Thank you all!
Here is my server information:
OS: Ubuntu 20.04
$ nvidia-smi
Mon Oct 28 14:02:03 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Quadro RTX 5000 Off | 00000001:9B:00.0 Off | Off |
| 33% 27C P8 2W / 230W | 6MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
I also ran gdscheck.py
to verify the environment status.
$ python3 /usr/local/cuda-12.6/gds/tools/gdscheck.py -p
GDS release version: 1.11.1.6
nvidia_fs version: 2.22 libcufile version: 2.12
Platform: x86_64
============
ENVIRONMENT:
============
=====================
DRIVER CONFIGURATION:
=====================
NVMe : Supported
NVMeOF : Supported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
BeeGFS : Unsupported
WekaFS : Unsupported
Userspace RDMA : Unsupported
--Mellanox PeerDirect : Enabled
--rdma library : Not Loaded (libcufile_rdma.so)
--rdma devices : Not configured
--rdma_device_status : Up: 0 Down: 0
=====================
CUFILE CONFIGURATION:
=====================
properties.use_compat_mode : true
properties.force_compat_mode : false
properties.gds_rdma_write_support : true
properties.use_poll_mode : false
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_size : 128
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 16384
properties.max_device_cache_size_kb : 131072
properties.max_device_pinned_mem_size_kb : 33554432
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 32
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 0
fs.generic.posix_unaligned_writes : false
fs.lustre.posix_gds_min_kb: 0
fs.beegfs.posix_gds_min_kb: 0
fs.weka.rdma_write_support: false
fs.gpfs.gds_write_support: false
profile.nvtx : false
profile.cufile_stats : 0
miscellaneous.api_check_aggressive : false
execution.max_io_threads : 4
execution.max_io_queue_depth : 128
execution.parallel_io : true
execution.min_io_threshold_size_kb : 8192
execution.max_request_parallelism : 4
properties.force_odirect_mode : false
properties.prefer_iouring : false
=========
GPU INFO:
=========
GPU index 0 Quadro RTX 5000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
==============
PLATFORM INFO:
==============
IOMMU: disabled
Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
Cuda Driver Version Installed: 12060
Platform: KS 5000U, Arch: x86_64(Linux 5.15.0-124-generic)
Platform verification succeeded
Running gdsio
works without any issues, and there are no errors in cufile.log
.
./gdsio -f {NVMe_Path}/testfile -d 0 -w 16 -s 20G -i 1M -x 0 -I 1
IoType: WRITE XferType: GPUD Threads: 16 DataSetSize: 19604480/20971520(KiB) IOSize: 1024(KiB) Throughput: 8.830188 GiB/sec, Avg_Latency: 1772.514806 usecs ops: 19145 total_time 2.117315 secs