SNAP-4 xlio packet limitation

I saw this error at SNAP-4 container in BF3 during NVMe/TCP test. It might be related to the fio block size more that 1M. Is there any related configuration or environment to extend xlio packet and prevent connection failure?

[2025-07-04 02:21:59.475959] nvme_nvda_tcp.c: 751:xlio_sock_get_packet: *WARNING*: Not enough xlio packets, using dynamic allocation. Performance may be degraded
[2025-07-04 02:22:12.436704] bdev_nvme.c:5257:timeout_cb: *WARNING*: [nqn.2025-03.io.spdk:cnode1, 1] Warning: Detected a timeout. ctrlr=0xaaab0100a010 qpair=0x200004a09bc0 cid=12
[2025-07-04 02:22:12.437328] bdev_nvme.c:5257:timeout_cb: *WARNING*: [nqn.2025-03.io.spdk:cnode1, 1] Warning: Detected a timeout. ctrlr=0xaaab0100a010 qpair=0x200004a09bc0 cid=13

Here are configurations that I set to run SNAP-4 with xlio

  1. mlxconfig
$ sudo mlxconfig -d /dev/mst/mt41692_pciconf0 -e q | grep -i \*
*       NVME_EMULATION_ENABLE                       False(0)             True(1)              True(1)
*       NVME_EMULATION_NUM_PF                       1                    2                    2
*       NVME_EMULATION_NUM_MSIX                     0                    64                   0
*       VIRTIO_BLK_EMULATION_NUM_MSIX               2                    0                    2
*       VIRTIO_FS_EMULATION_NUM_MSIX                2                    0                    2
*       VIRTIO_NET_EMULATION_NUM_MSIX               2                    0                    2
*       PER_PF_NUM_SF                               False(0)             True(1)              True(1)
*       NVME_EMU_MNG_ENABLE                         False(0)             True(1)              False(0)
*       NVME_EMU_MNG_NUM_PF                         1                    2                    1
*       PF_TOTAL_SF                                 0                    32                   2
*       PF_SF_BAR_SIZE                              0                    8                    8
The '*' shows parameters with next value different from default/current value.
  1. xlio.yaml
---
apiVersion: v1
kind: Pod
metadata:
  name: snap
spec:
  hostNetwork: true
  containers:
  - name: snap
    image: nvcr.io/nvidia/doca/doca_snap:4.7.0-doca3.0.0
    imagePullPolicy: IfNotPresent
    securityContext:
      privileged: true
      capabilities:
        add: ["IPC_LOCK", "SYS_RAWIO", "SYS_NICE"]
    volumeMounts:
    - mountPath: /dev/hugepages
      name: hugepages
    - mountPath: /dev/shm
      name: shm
    - mountPath: /dev/infiniband
      name: infiniband
    - mountPath: /dev/vfio
      name: vfio
    - mountPath: /etc/nvda_snap
      name: conf
    - mountPath: /var/log/snap-log
      name: snap-log
    resources:
      requests:
        memory: "4Gi"
        cpu: "8"
      limits:
        hugepages-2Mi: "12Gi"
        memory: "16Gi"
        cpu: "16"
    env:
      ## To enable XLIO un-comment SPDK_XLIO_PATH
      ## App-Specific command line arguments
      - name: APP_ARGS
        value: "--wait-for-rpc"
      - name: SPDK_XLIO_PATH
        value: "/usr/lib/libxlio.so"
      #- name: SPDK_RPC_INIT_CONF_JSON
      #  value: "/etc/nvda_snap/config.json"
      - name: SPDK_RPC_INIT_CONF
        value: "/etc/nvda_snap/spdk_rpc_init.conf"
      - name: SNAP_RPC_INIT_CONF
        value: "/etc/nvda_snap/snap_rpc_init.conf"
      - name: XLIO_RX_BUFS
        value: "8192"
      - name: XLIO_TX_BUFS
        value: "8192"
      - name: SNAP_MEMPOOL_SIZE_MB
        value: "8192"
  volumes:
  - name: hugepages
    emptyDir:
      medium: HugePages
  - name: shm
    hostPath:
      path: /dev/shm
  - name: infiniband
    hostPath:
      path: /dev/infiniband
  - name: vfio
    hostPath:
      path: /dev/vfio
  - name: conf
    hostPath:
      path: /etc/nvda_snap
  - name: snap-log
    hostPath:
      path: /var/log/snap-log
  1. SNAP-4 RPC
spdk_rpc.py sock_set_default_impl -i xlio
spdk_rpc.py framework_start_init
spdk_rpc.py bdev_nvme_set_options --transport-ack-timeout 12

spdk_rpc.py bdev_nvme_attach_controller -b Nvme0 -t nvda_tcp -a 105.22.0.101 -f ipv4 -s 4420 -n nqn.2025-03.io.spdk:cnode0
spdk_rpc.py bdev_nvme_attach_controller -b Nvme1 -t nvda_tcp -a 105.22.1.101 -f ipv4 -s 4420 -n nqn.2025-03.io.spdk:cnode1

snap_rpc.py nvme_subsystem_create --nqn nqn.2022-10.io.nvda.nvme:0
snap_rpc.py nvme_subsystem_create --nqn nqn.2022-10.io.nvda.nvme:1

snap_rpc.py nvme_namespace_create -b Nvme0n1 -n 1 --nqn nqn.2022-10.io.nvda.nvme:0 --uuid 3d9c3b54-5c31-410a-b4f0-7cf2afd9e111
snap_rpc.py nvme_namespace_create -b Nvme1n1 -n 2 --nqn nqn.2022-10.io.nvda.nvme:1 --uuid 3d9c3b54-5c31-410a-b4f0-7cf2afd9e112

snap_rpc.py nvme_controller_create --nqn nqn.2022-10.io.nvda.nvme:0 --ctrl NVMeCtrl1 --pf_id 0 --suspended -n 31
snap_rpc.py nvme_controller_create --nqn nqn.2022-10.io.nvda.nvme:1 --ctrl NVMeCtrl2 --pf_id 1 --suspended -n 31

snap_rpc.py nvme_controller_attach_ns -c NVMeCtrl1 -n 1
snap_rpc.py nvme_controller_attach_ns -c NVMeCtrl2 -n 2

snap_rpc.py nvme_controller_resume -c NVMeCtrl1
snap_rpc.py nvme_controller_resume -c NVMeCtrl2
 XLIO INFO   : ---------------------------------------------------------------------------
 XLIO INFO   : XLIO_VERSION: 3.50.3-1 Release built on Mar 31 2025 07:01:18
 XLIO INFO   : Git: 0d8f272ca2ac1440db92a477075578d9ec5bf8cb
 XLIO INFO   : Cmd Line: /opt/nvidia/nvda_snap/bin/snap_service --wait-for-rpc -r /var/tmp/spdk.sock
 XLIO INFO   : OFED Version: OFED-internal-25.04-0.6.1:
 XLIO INFO   : ---------------------------------------------------------------------------
 XLIO INFO   : Spec                           NVMEoTCP Profile for BF3   [XLIO_SPEC]
 XLIO INFO   : Log Level                      INFO                       [XLIO_TRACELEVEL]
 XLIO INFO   : Ring On Device Memory TX       1024                       [XLIO_RING_DEV_MEM_TX]
 XLIO INFO   : Tx QP WRE                      1024                       [XLIO_TX_WRE]
 XLIO INFO   : Tx QP WRE Batching             128                        [XLIO_TX_WRE_BATCHING]
 XLIO INFO   : Tx Bufs Batch TCP              1                          [XLIO_TX_BUFS_BATCH_TCP]
 XLIO INFO   : Rx QP WRE                      32                         [XLIO_RX_WRE]
 XLIO INFO   : Rx Prefetch Bytes Before Poll  256                        [XLIO_RX_PREFETCH_BYTES_BEFORE_POLL]
 XLIO INFO   : GRO max streams                0                          [XLIO_GRO_STREAMS_MAX]
 XLIO INFO   : STRQ Strides per RWQE          8192                       [XLIO_STRQ_NUM_STRIDES]
 XLIO INFO   : CQ Drain Thread                Disabled                   [XLIO_PROGRESS_ENGINE_INTERVAL]
 XLIO INFO   : CQ Adaptive Moderation         Disabled                   [XLIO_CQ_AIM_INTERVAL_MSEC]
 XLIO INFO   : CQ Keeps QP Full               Disabled                   [XLIO_CQ_KEEP_QP_FULL]
 XLIO INFO   : QP Compensation Level          8                          [XLIO_QP_COMPENSATION_LEVEL]
 XLIO INFO   : TCP nodelay                    1                          [XLIO_TCP_NODELAY]
 XLIO INFO   : Avoid sys-calls on tcp fd      Enabled                    [XLIO_AVOID_SYS_CALLS_ON_TCP_FD]
 XLIO INFO   : Internal Thread Affinity       0x01                       [XLIO_INTERNAL_THREAD_AFFINITY]
 XLIO INFO   : Memory limit                   256 MB                     [XLIO_MEMORY_LIMIT]
 XLIO INFO   : Memory limit (user allocator)  2 GB                       [XLIO_MEMORY_LIMIT_USER]
 XLIO INFO   : SocketXtreme mode              Enabled                    [XLIO_SOCKETXTREME]
 XLIO INFO   : TSO support                    Enabled                    [XLIO_TSO]
 XLIO INFO   : LRO support                    Enabled                    [XLIO_LRO]
 XLIO INFO   : fork() support                 Disabled                   [XLIO_FORK]
 XLIO INFO   : TCP abort on close             Enabled                    [XLIO_TCP_ABORT_ON_CLOSE]
 XLIO INFO   : ---------------------------------------------------------------------------
  1. nvmf_tgt
sudo scripts/rpc.py -s /var/tmp/spdk-mango-00.sock iobuf_set_options \
  --small-pool-count 32767 --large-pool-count 16383
sudo scripts/rpc.py -s /var/tmp/spdk-mango-00.sock framework_start_init
sudo scripts/rpc.py -s /var/tmp/spdk-mango-00.sock \
  nvmf_create_transport \
  --trtype TCP \
  --max-queue-depth 128 \
  --max-io-qpairs-per-ctrlr 127 \
  --in-capsule-data-size 8192 \
  --io-unit-size 8192 \
  --max-aq-depth 128 \
  --num-shared-buffers 8192 \
  --buf-cache-size 32 \
  --sock-priority 0 \
  --abort-timeout-sec 1
for i in {0..1};
do
  sudo scripts/rpc.py -s /var/tmp/spdk-mango-00.sock \
    bdev_null_create "Nullb"$i 65536 1024
done
for i in {0..1};
do
  sudo scripts/rpc.py -s /var/tmp/spdk-mango-00.sock \
    nvmf_create_subsystem "nqn.2025-03.io.spdk:cnode"$i -a \
    -s "SPDK0000000000000"$i -d "SPDK_Controller"$i
done
for i in {0..1};
do
  sudo scripts/rpc.py -s /var/tmp/spdk-mango-00.sock \
    nvmf_subsystem_add_ns "nqn.2025-03.io.spdk:cnode"$i "Nullb"$i
done
for i in {0..1};
do
  sudo scripts/rpc.py -s /var/tmp/spdk-mango-00.sock \
    nvmf_subsystem_add_listener "nqn.2025-03.io.spdk:cnode"$i -t tcp -a 105.22.$i.101 -s 4420
done
  1. fio
$ cat fio_config.fio
[global]
ioengine=libaio
direct=1

group_reporting=1
random_generator=tausworthe64
time_based=1
runtime=100
direct=1
rw=randread
bs=1M
numjobs=32
iodepth=128
rwmixread=50
cpus_allowed_policy=split

[job0]
filename=/dev/nvme0n1
cpus_allowed=0-15,48-63

[job1]
filename=/dev/nvme1n1
cpus_allowed=16-47

$ lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda       8:0    0 447.1G  0 disk
├─sda1    8:1    0     1G  0 part
└─sda2    8:2    0 446.1G  0 part
sdb       8:16   0 931.5G  0 disk
├─sdb1    8:17   0     1G  0 part /boot/efi
└─sdb2    8:18   0 930.5G  0 part /
nvme0n1 259:0    0    64G  0 disk
nvme1n1 259:1    0    64G  0 disk

Hi polaris6921,

In your provided configuration, XLIO_RX_BUFS and XLIO_TX_BUFS both are set to 8192, which is already higher than the default, but for very large block sizes and high concurrency, you may need to increase these further (e.g., 16384 or higher) depending on memory availability.

Regards,

Quanying