NVMe-oF test, "attr_offload" test failed

zhangqi.public · February 26, 2024, 7:40am

Hi
I use the BlueField-3 card as the NVMe-oF Target, and the networking is as follows:
[X86/Initiator]----- NVMe-oF(RDMA)-------[BlueField/Target]— NVME-SSD

When I turn off the attr_offload(attr_offload=0), The test result is OK,
When I turn on the attr_offload(attr_offload=1), I get the following “XRQ NVMF backend ctrl timeout error” error in the logs, and the ctx of the offload is removing,

Every time 20 seconds after executing the connect command on the Initiator side, the “XRQ NVMF backend ctrl timeout error（22）” error will inevitably appear， 22 means IB_EVENT_XRQ_NVMF_BACKEND_CTRL_TO_ERR，

Can anyone explain why the XRQ NVMF backend ctrl has a timeout error and what could go wrong?
Thanks in advance for any assistance or suggestions.

I followed this doc：
https://enterprise-support.nvidia.com/s/article/howto-configure-nvme-over-fabrics--nvme-of--target-offload

zhangqi.public · February 26, 2024, 7:43am

07:15:15 kernel: [ 919.654806] nvmet: creating nvm controller 1 for subsystem testsubsystem for NQN nqn.2014-08.org.nvmexpress:uuid:ef0cec00-a846-11ea-8000-ac1f6b3ea450.

07:15:15 kernel: [ 919.692117] nvmet_rdma: connect request (4): status 0 id 0000000020477746
07:15:15 kernel: [ 919.705730] nvmet_rdma: added mlx5_0.
07:15:15 kernel: [ 919.717137] nvmet_rdma: nvmet_rdma_create_queue_ib: max_cqe= 8191 max_sge= 30 sq_size = 102 cm_id= 0000000020477746
07:15:15 kernel: [ 919.738992] nvmet_rdma: established (9): status 0 id 0000000020477746
… …
… …
07:15:18 kernel: [ 922.503156] nvmet_rdma: connect request (4): status 0 id 00000000675c1077
07:15:18 kernel: [ 922.516756] nvmet_rdma: added mlx5_0.
07:15:18 kernel: [ 922.528174] nvmet_rdma: nvmet_rdma_create_queue_ib: max_cqe= 8191 max_sge= 30 sq_size = 102 cm_id= 00000000675c1077
07:15:18 kernel: [ 922.550005] nvmet_rdma: established (9): status 0 id 00000000675c1077

07:15:18 kernel: [ 922.562943] nvmet_rdma: using dynamic staging buffer 0000000053e1f05e
07:15:18 kernel: [ 922.622009] nvmet: Adding offload ctx 0 to configfs
07:15:18 kernel: [ 922.634362] nvmet: adding queue 1 to ctrl 1.
07:15:18 kernel: [ 922.674259] nvmet: adding queue 2 to ctrl 1.
… …
… …
07:15:20 kernel: [ 924.922805] nvmet: adding queue 47 to ctrl 1.
07:15:20 kernel: [ 924.972539] nvmet: adding queue 48 to ctrl 1.
… …
07:15:23 kernel: [ 927.351996] nvmet: ctrl 1 update keep-alive timer for 5 secs
07:15:25 kernel: [ 929.909981] nvmet: ctrl 1 update keep-alive timer for 5 secs
07:15:28 kernel: [ 932.467996] nvmet: ctrl 1 update keep-alive timer for 5 secs
07:15:30 kernel: [ 935.026018] nvmet: ctrl 1 update keep-alive timer for 5 secs
07:15:33 kernel: [ 937.584047] nvmet: ctrl 1 update keep-alive timer for 5 secs
07:15:35 kernel: [ 940.142050] nvmet: ctrl 1 update keep-alive timer for 5 secs
… …
07:15:37 kernel: [ 941.661828] nvme 0000:11:00.0: received IB Backend ctrl event: XRQ NVMF backend ctrl timeout error (22) be_ctrl 00000000f9eb18d8 id 0
07:15:37 kernel: [ 941.685916] nvmet: Removing offload ctx 0 from configfs

zhangqi.public · February 27, 2024, 1:16am

Version:
BlueField-3 SmartNIC Main Card (900-9D3C6-00SV-DA0)
Linux 5.15.0-1032-bluefield
MLNX_OFED_LINUX-23.10-1.2.0

zhangqi.public · February 27, 2024, 1:33am

The function call stack for the IB_EVENT_XRQ_NVMF_BACKEND_CTRL_TO_ERR ：

mlx5_srq_table->nb.notifier_call
srq_event_notifier
mlx5_ib_nvmf_backend_ctrl_event
nvmet_rdma_backend_ctrl_event
event(22) : IB_EVENT_XRQ_NVMF_BACKEND_CTRL_TO_ERR

zhangqi.public · February 27, 2024, 5:51am

I followed this doc：
https://enterprise-support.nvidia.com/s/article/howto-configure-nvme-over-fabrics--nvme-of--target-offload

cat /sys/module/nvme/parameters/num_p2p_queues

modprobe nvmet
modprobe nvmet-rdma

mkdir /sys/kernel/config/nvmet/subsystems/testsubsystem
echo 1 > /sys/kernel/config/nvmet/subsystems/testsubsystem/attr_allow_any_host
echo 1 > /sys/kernel/config/nvmet/subsystems/testsubsystem/attr_offload

mkdir /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1
echo -n /dev/nvme1n1 > /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1/device_path
echo 1 > /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1/enable

mkdir /sys/kernel/config/nvmet/ports/1
ls /sys/kernel/config/nvmet/ports/1
echo 4420 > /sys/kernel/config/nvmet/ports/1/addr_trsvcid
echo 10.192.2.20 > /sys/kernel/config/nvmet/ports/1/addr_traddr
echo “rdma” > /sys/kernel/config/nvmet/ports/1/addr_trtype
echo “ipv4” > /sys/kernel/config/nvmet/ports/1/addr_adrfam

ln -s /sys/kernel/config/nvmet/subsystems/testsubsystem/ /sys/kernel/config/nvmet/ports/1/subsystems/testsubsystem

zhangqi.public · February 29, 2024, 12:11pm

mlx5_fw: 0000:03:00.0 [0x2d20d6e56ec4] 0 [0x5] handle_nvmf_respond_exception sending_uapp_sw_cqe: gvmi: 0x0,qpn: 0x129
mlx5_fw: 0000:03:00.0 [0x2d20d6e5898c] 0 [0x5] handle_ace_req_cqe_error release: gvmi 0x0, tgt_num 0x10, qpn 0x129, action: 0x80000004

abirman · April 4, 2024, 9:23am

Hi,

Thanks for sharing details.
Please make sure the current DPU firmware is updated.
https://network.nvidia.com/support/firmware/bluefield3/

You can run the below command inside the DPU to upgrade the firmware (it will choose the correct version automatically):
/opt/mellanox/mlnx-fw-updater/mlnx_fw_updater.pl

If you upgrade the firmware, please perform the full power cycle of the server (cold boot) and re-try the test.

If the issue is still present with the recommended firmware, please open a support case in Nvidia portal, and it will be handled according to the support entitlement.

Best Regards,
Anatoly

Jasonxubiubiu · April 9, 2024, 1:06pm

Hello, I meet the same error: received IB Backend ctrl event: XRQ NVMF backend ctrl timeout error (22) when I used the nvmf Target offloading with Bluefield-2.
I also used the MLNX_OFED_LINUX-23.10-2.1.3.1-rhel7.9-x86_64.
Do you find the solution?

sr114 · May 10, 2024, 9:45am

I have the same problem. NVMEoF with offload works for some short time, max about 2 hours at very small load.

[23642.359750] nvme 0000:07:00.0: received IB Backend ctrl event: XRQ NVMF backend ctrl timeout error (22) be_ctrl 00000000ffcff958 id 6
[23642.361142] nvmet: Removing offload ctx 6 from configfs

On the target side:

debian 12.1
kernel: 6.1.0-10-amd64
41:00.0 Infiniband controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]

On the initiator side:

Oracle Linux Server release 8.9
kernel: 5.4.17-2136.330.7.1.el8uek.x86_64
01:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]

user157267 · May 17, 2024, 2:01pm

So in your case, you have no Bluefield DPU at all?

user157267 · May 17, 2024, 2:14pm

Hi zhangqi.public,
I’m curious about how you obtained this function call stack. Did you use GDB? If so, could you explain the steps you followed?

Topic		Replies	Views
Bluefield-2 NVMe-oF Target Offloading received IB Backend ctrl event: XRQ NVMF backend ctrl timeout error (22) Ethernet Adapter Cards nvme-over-fabrics , bluefield-smartnic , target	4	633	May 1, 2024
Setting up Mellanox NVMf offload Ethernet Adapter Cards	4	1374	July 18, 2025
NVMEoF Target Offload setup on Connect X BlueField	2	1261	March 7, 2024
Issues while configuring NVMe over Fabrics (NVMe-oF) Target Offload Mellanox OFED	2	2099	February 21, 2025
With CX-5 NIC setup kernel NVMeOF following standard steps, nvmet report error when we try run nvme connect BlueField	3	1273	November 4, 2022
Handle_ace_req_cqe_error BlueField	1	330	March 6, 2024
Bluefield dpu 3 ovs offload failed with some matches of ufid 66279714-8e7d-44a4-b0af-4de1c1744599 are not supported: ct_state=+new-est-rel-rpl+trk BlueField	1	55	September 9, 2025
Nvmet hw offload issue Ethernet Adapter Cards	4	1847	April 10, 2024
I can only connect 9 nvme devices. When I try to connect 10th device it is failing Mellanox OFED	8	1258	June 12, 2025
Setting up NVMe-oF Target Offload: Unable to Set `num_p2p_queues` Parameter Mellanox OFED	3	366	September 18, 2024

NVMe-oF test, "attr_offload" test failed

Related topics