NVMe-oF test, "attr_offload" test failed

Hi
I use the BlueField-3 card as the NVMe-oF Target, and the networking is as follows:
[X86/Initiator]----- NVMe-oF(RDMA)-------[BlueField/Target]— NVME-SSD

When I turn off the attr_offload(attr_offload=0), The test result is OK,
When I turn on the attr_offload(attr_offload=1), I get the following “XRQ NVMF backend ctrl timeout error” error in the logs, and the ctx of the offload is removing,

Every time 20 seconds after executing the connect command on the Initiator side, the “XRQ NVMF backend ctrl timeout error(22)” error will inevitably appear, 22 means IB_EVENT_XRQ_NVMF_BACKEND_CTRL_TO_ERR,

Can anyone explain why the XRQ NVMF backend ctrl has a timeout error and what could go wrong?
Thanks in advance for any assistance or suggestions.

I followed this doc:
https://enterprise-support.nvidia.com/s/article/howto-configure-nvme-over-fabrics--nvme-of--target-offload

07:15:15 kernel: [ 919.654806] nvmet: creating nvm controller 1 for subsystem testsubsystem for NQN nqn.2014-08.org.nvmexpress:uuid:ef0cec00-a846-11ea-8000-ac1f6b3ea450.

07:15:15 kernel: [ 919.692117] nvmet_rdma: connect request (4): status 0 id 0000000020477746
07:15:15 kernel: [ 919.705730] nvmet_rdma: added mlx5_0.
07:15:15 kernel: [ 919.717137] nvmet_rdma: nvmet_rdma_create_queue_ib: max_cqe= 8191 max_sge= 30 sq_size = 102 cm_id= 0000000020477746
07:15:15 kernel: [ 919.738992] nvmet_rdma: established (9): status 0 id 0000000020477746
… …
… …
07:15:18 kernel: [ 922.503156] nvmet_rdma: connect request (4): status 0 id 00000000675c1077
07:15:18 kernel: [ 922.516756] nvmet_rdma: added mlx5_0.
07:15:18 kernel: [ 922.528174] nvmet_rdma: nvmet_rdma_create_queue_ib: max_cqe= 8191 max_sge= 30 sq_size = 102 cm_id= 00000000675c1077
07:15:18 kernel: [ 922.550005] nvmet_rdma: established (9): status 0 id 00000000675c1077

07:15:18 kernel: [ 922.562943] nvmet_rdma: using dynamic staging buffer 0000000053e1f05e
07:15:18 kernel: [ 922.622009] nvmet: Adding offload ctx 0 to configfs
07:15:18 kernel: [ 922.634362] nvmet: adding queue 1 to ctrl 1.
07:15:18 kernel: [ 922.674259] nvmet: adding queue 2 to ctrl 1.
… …
… …
07:15:20 kernel: [ 924.922805] nvmet: adding queue 47 to ctrl 1.
07:15:20 kernel: [ 924.972539] nvmet: adding queue 48 to ctrl 1.
… …
07:15:23 kernel: [ 927.351996] nvmet: ctrl 1 update keep-alive timer for 5 secs
07:15:25 kernel: [ 929.909981] nvmet: ctrl 1 update keep-alive timer for 5 secs
07:15:28 kernel: [ 932.467996] nvmet: ctrl 1 update keep-alive timer for 5 secs
07:15:30 kernel: [ 935.026018] nvmet: ctrl 1 update keep-alive timer for 5 secs
07:15:33 kernel: [ 937.584047] nvmet: ctrl 1 update keep-alive timer for 5 secs
07:15:35 kernel: [ 940.142050] nvmet: ctrl 1 update keep-alive timer for 5 secs
… …
07:15:37 kernel: [ 941.661828] nvme 0000:11:00.0: received IB Backend ctrl event: XRQ NVMF backend ctrl timeout error (22) be_ctrl 00000000f9eb18d8 id 0
07:15:37 kernel: [ 941.685916] nvmet: Removing offload ctx 0 from configfs

Version:
BlueField-3 SmartNIC Main Card (900-9D3C6-00SV-DA0)
Linux 5.15.0-1032-bluefield
MLNX_OFED_LINUX-23.10-1.2.0

The function call stack for the IB_EVENT_XRQ_NVMF_BACKEND_CTRL_TO_ERR :

mlx5_srq_table->nb.notifier_call
srq_event_notifier
mlx5_ib_nvmf_backend_ctrl_event
nvmet_rdma_backend_ctrl_event
event(22) : IB_EVENT_XRQ_NVMF_BACKEND_CTRL_TO_ERR

I followed this doc:
https://enterprise-support.nvidia.com/s/article/howto-configure-nvme-over-fabrics--nvme-of--target-offload


cat /sys/module/nvme/parameters/num_p2p_queues

modprobe nvmet
modprobe nvmet-rdma

mkdir /sys/kernel/config/nvmet/subsystems/testsubsystem
echo 1 > /sys/kernel/config/nvmet/subsystems/testsubsystem/attr_allow_any_host
echo 1 > /sys/kernel/config/nvmet/subsystems/testsubsystem/attr_offload

mkdir /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1
echo -n /dev/nvme1n1 > /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1/device_path
echo 1 > /sys/kernel/config/nvmet/subsystems/testsubsystem/namespaces/1/enable

mkdir /sys/kernel/config/nvmet/ports/1
ls /sys/kernel/config/nvmet/ports/1
echo 4420 > /sys/kernel/config/nvmet/ports/1/addr_trsvcid
echo 10.192.2.20 > /sys/kernel/config/nvmet/ports/1/addr_traddr
echo “rdma” > /sys/kernel/config/nvmet/ports/1/addr_trtype
echo “ipv4” > /sys/kernel/config/nvmet/ports/1/addr_adrfam

ln -s /sys/kernel/config/nvmet/subsystems/testsubsystem/ /sys/kernel/config/nvmet/ports/1/subsystems/testsubsystem

mlx5_fw: 0000:03:00.0 [0x2d20d6e56ec4] 0 [0x5] handle_nvmf_respond_exception sending_uapp_sw_cqe: gvmi: 0x0,qpn: 0x129
mlx5_fw: 0000:03:00.0 [0x2d20d6e5898c] 0 [0x5] handle_ace_req_cqe_error release: gvmi 0x0, tgt_num 0x10, qpn 0x129, action: 0x80000004

Hi,

Thanks for sharing details.
Please make sure the current DPU firmware is updated.
https://network.nvidia.com/support/firmware/bluefield3/

You can run the below command inside the DPU to upgrade the firmware (it will choose the correct version automatically):
/opt/mellanox/mlnx-fw-updater/mlnx_fw_updater.pl

If you upgrade the firmware, please perform the full power cycle of the server (cold boot) and re-try the test.

If the issue is still present with the recommended firmware, please open a support case in Nvidia portal, and it will be handled according to the support entitlement.

Best Regards,
Anatoly

Hello, I meet the same error: received IB Backend ctrl event: XRQ NVMF backend ctrl timeout error (22) when I used the nvmf Target offloading with Bluefield-2.
I also used the MLNX_OFED_LINUX-23.10-2.1.3.1-rhel7.9-x86_64.
Do you find the solution?