GPUDirectStorage cuFileWrite() error with RTX A4000

I want to use GPUDirectStorage with NVIDIA RTX A4000 at Rocky8.6.

But I encountered an error while executing the cufile sample_001 program provided at the following,
(MagnumIO/gds/samples/cufile_sample_001.cc at main · NVIDIA/MagnumIO · GitHub)

Could you please provide guidance on how to resolve this issue.

Error Message:

cufile.log
 04-10-2024 11:58:35:193 [pid=1958 tid=1958] ERROR  0:1534 IOCTL failed io-type 1 ret -5 expected 131072 gpu_page_offset 0
dmesg
[ 1191.211886] nvme nvme1: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x2010
[ 1191.226918] nvme nvme1: Shutdown timeout set to 8 seconds
[ 1191.279382] nvme nvme1: 12/0/0 default/read/poll queues

The same problem occurs when I execute the gdsio program when write (-I 1), but there are no issues when read (-I 0).

read (-I 0)
#  ./gdsio -f /mnt/gdstest/test1.log -d 0 -w 1 -s 1G -i 32k -I 0 -x 0
IoType: READ XferType: GPUD Threads: 1 DataSetSize: 1048576/1048576(KiB) IOSize: 32(KiB) Throughput: 1.285050 GiB/sec, Avg_Latency: 23.500671 usecs ops: 32768 total_time 0.778180 secs

write (-I 1)
#   ./gdsio -f /mnt/gdstest/test1.log -d 0 -w 1 -s 1G -i 32k -I 1 -x 0
write io failed of type 1 size: 32768 , ret: 0
failed to submit io of type 1 ret: -5
Error: IO failed stopping traffic, fd :51 ret:-5 errno :5
io failed :ret :-5 errno :5, file offset :0, block size  :32768

Platform:

OS
# cat /etc/redhat-release
Rocky Linux release 8.6 (Green Obsidian)

# uname -a
Linux testhost 4.18.0-372.32.1.el8_6.x86_64 #1 SMP Thu Oct 27 15:18:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

# ofed_info -s
MLNX_OFED_LINUX-24.04-0.6.6.0:

# rpm -qa | grep mlnx
mlnx-ofa_kernel-devel-24.04-OFED.24.04.0.6.6.1.rhel8u6.x86_64
libibumad-2404mlnx51-1.2404066.x86_64
knem-1.1.4.90mlnx3-OFED.23.10.0.2.1.1.rhel8u6.x86_64
kmod-mlnx-nvme-24.04-OFED.24.04.0.6.6.1.rhel8u6.x86_64
libibverbs-utils-2404mlnx51-1.2404066.x86_64
mlnx-ofa_kernel-24.04-OFED.24.04.0.6.6.1.rhel8u6.x86_64
kmod-kernel-mft-mlnx-4.28.0-1.rhel8u6.x86_64
kmod-mlnx-nfsrdma-24.04-OFED.24.04.0.6.6.1.rhel8u6.x86_64
rdma-core-2404mlnx51-1.2404066.x86_64
rdma-core-devel-2404mlnx51-1.2404066.x86_64
librdmacm-utils-2404mlnx51-1.2404066.x86_64
srp_daemon-2404mlnx51-1.2404066.x86_64
mlnx-iproute2-6.7.0-1.2404066.x86_64
mlnx-fw-updater-24.04-0.6.6.0.x86_64
libibverbs-2404mlnx51-1.2404066.x86_64
mlnx-ofa_kernel-source-24.04-OFED.24.04.0.6.6.1.rhel8u6.x86_64
infiniband-diags-2404mlnx51-1.2404066.x86_64
ibacm-2404mlnx51-1.2404066.x86_64
mlnx-ethtool-6.7-1.2404066.x86_64
mlnxofed-docs-24.04-0.6.6.0.noarch
mlnx-tools-24.04-0.2404066.x86_64
kmod-knem-1.1.4.90mlnx3-OFED.23.10.0.2.1.1.rhel8u6.x86_64
kmod-mlnx-ofa_kernel-24.04-OFED.24.04.0.6.6.1.rhel8u6.x86_64
librdmacm-2404mlnx51-1.2404066.x86_64

NVMe device
# nvme list
Node                  SN                   Model                                    Namespace Usage                      Format           FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme1n1          S649NU0WB60430K      Samsung SSD 980 1TB                      1         667.40  GB /   1.00  TB    512   B +  0 B   3B4QFXO7

Filesystem
/dev/nvme1n1 on /mnt type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=32,swidth=256,noquota)

/usr/local/cuda/gds/tools/gdscheck -p
 GDS release version: 1.11.1.6
 nvidia_fs version:  2.22 libcufile version: 2.12
 Platform: x86_64
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe               : Supported
 NVMeOF             : Unsupported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Supported
 BeeGFS             : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Disabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_compat_mode : true
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384
 properties.posix_pool_slab_count : 128 64 32
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.beegfs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 fs.gpfs.gds_write_support: false
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 execution.max_io_threads : 4
 execution.max_io_queue_depth : 128
 execution.parallel_io : true
 execution.min_io_threshold_size_kb : 8192
 execution.max_request_parallelism : 4
 properties.force_odirect_mode : false
 properties.prefer_iouring : false
 =========
 GPU INFO:
 =========
 GPU index 0 NVIDIA RTX A4000 bar:1 bar size (MiB):32768 supports GDS, IOMMU State: Disabled
 ==============
 PLATFORM INFO:
 ==============
 Found ACS enabled for switch 0000:00:1c.4
 IOMMU: disabled
 Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
 Cuda Driver Version Installed:  12060
 Platform: DGI5G60BC65CNHB3, Arch: x86_64(Linux 4.18.0-372.32.1.el8_6.x86_64)
 Platform verification succeeded

# nvidia-smi
Fri Oct  4 12:54:35 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A4000               Off |   00000000:05:00.0 Off |                  Off |
| 41%   35C    P8              7W /  140W |       2MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
1 Like

I updated the driver to v24.07-0.6.1.0 , and I also changed the SSD manufacturer (Phison Electronics, MAXIO Technology (Hangzhou) ) , but the results did not change.

From the results of the gdscheck , I noticed that PCI ACS is enabled.

==============
 PLATFORM INFO:
 ==============
 Found ACS enabled for switch 0000:00:1c.4
 IOMMU: disabled
 Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)

So, I turned it off like below, but the results did not change.

# setpci  -v -s 00:1c.4 220+6.w=0000
# lspci -vv -s 00:1c.4|grep 'Access Control Services' -A2
        Capabilities: [220 v1] Access Control Services
                ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

I also tried turning off VT-d in the BIOS, but the results did not change.

Could you please let me know if there’s anything else I should try?