Run vm with vdpa interface inside container cause host crash randomly

I’m using bluefield2 card and create vdpa device based on vf and sf.
I can start vm with vdpa interface directly on host and everything is fine.
But when I create a container, mount /dev/vhost-vdpa-x into container, and start vm with vdpa interface, host crash randomly(Sometimes, the vm can start successfully).

kernel version: 5.15.0-101-generic

Got kernel logs bellow:

Mar 22 07:44:11 c kernel: [ 1751.655385] mlx5_core 0000:03:00.4: mlx5_vdpa_reset:2236:(pid 26153): performing device reset
Mar 22 07:44:14 c kernel: [ 1754.360813] mlx5_core 0000:03:00.4: mlx5_vdpa_handle_set_map:565:(pid 27096): memory map update
Mar 22 07:44:14 c kernel: [ 1754.546172] mlx5_core 0000:03:00.4: mlx5_vdpa_handle_set_map:565:(pid 27097): memory map update
Mar 22 07:44:14 c kernel: [ 1754.568751] mlx5_core 0000:03:00.4: mlx5_cmd_check:782:(pid 27097): QUERY_GENERAL_OBJECT(0xa02) op_mod(0xd) failed, status bad parameter(0x3), syndrome (0xe108ed)
Mar 22 07:44:14 c kernel: [ 1754.568757] mlx5_core 0000:03:00.4: suspend_vq:1208:(pid 27097) warning: failed to query virtqueue
Mar 22 07:44:14 c kernel: [ 1754.569245] mlx5_core 0000:03:00.4: mlx5_cmd_check:782:(pid 27097): QUERY_GENERAL_OBJECT(0xa02) op_mod(0xd) failed, status bad parameter(0x3), syndrome (0xe108ed)
Mar 22 07:44:14 c kernel: [ 1754.588185] mlx5_core 0000:03:00.4: mlx5_cmd_check:782:(pid 27097): DESTROY_GENERAL_OBJECT(0xa03) op_mod(0xd) failed, status bad resource state(0x9), syndrome (0xb60e9c)
Mar 22 07:44:14 c kernel: [ 1754.588190] mlx5_core 0000:03:00.4: destroy_virtqueue:908:(pid 27097) warning: destroy virtqueue 0x100b
Mar 22 07:44:14 c kernel: [ 1754.589325] mlx5_core 0000:03:00.4: mlx5_cmd_check:782:(pid 27097): DESTROY_QP(0x501) op_mod(0x0) failed, status bad resource state(0x9), syndrome (0x4a6fc9)
Mar 22 07:44:14 c kernel: [ 1754.589328] mlx5_core 0000:03:00.4: qp_destroy:503:(pid 27097) warning: destroy qp 0x4e4
Mar 22 07:44:14 c kernel: [ 1754.589733] mlx5_core 0000:03:00.4: mlx5_cmd_check:782:(pid 24707): DESTROY_QP(0x501) op_mod(0x0) failed, status bad resource state(0x9), syndrome (0x25b161)
Mar 22 07:44:14 c kernel: [ 1754.589738] mlx5_core 0000:03:00.4: qp_destroy:503:(pid 24707) warning: destroy qp 0x4e5
Mar 22 07:44:14 c kernel: [ 1754.589744] general protection fault, probably for non-canonical address 0x31cb2f7f65ab9c0: 0000 [#1] SMP NOPTI
Mar 22 07:44:14 c kernel: [ 1754.589746] CPU: 9 PID: 24707 Comm: kworker/u24:2 Not tainted 5.15.0-101-generic #111-Ubuntu
Mar 22 07:44:14 c kernel: [ 1754.589749] Hardware name: Micro-Star International Co., Ltd. MS-7D42/MAG B660M MORTAR WIFI DDR4 (MS-7D42), BIOS 1.90 11/10/2022
Mar 22 07:44:14 c kernel: [ 1754.589750] Workqueue: mlx5_vdpa_wq mlx5_cvq_kick_handler [mlx5_vdpa]
Mar 22 07:44:14 c kernel: [ 1754.589756] RIP: 0010:__free_pages+0x13/0xc0
Mar 22 07:44:14 c kernel: [ 1754.589760] Code: 31 f6 e8 d0 fd ff ff 5d c3 cc cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 56 49 89 fe 41 55 41 54 53 <48> 8b 07 f0 ff 4f 34 74 6b a9 00 00 01 00 75 57 44 8d 6e ff 85 f6
Mar 22 07:44:14 c kernel: [ 1754.589762] RSP: 0018:ffffb3ed036e7cc8 EFLAGS: 00010207
Mar 22 07:44:14 c kernel: [ 1754.589763] RAX: 0000000000000000 RBX: 0000000000001fff RCX: 0000000000000000
Mar 22 07:44:14 c kernel: [ 1754.589765] RDX: ffff8dcc416d4a00 RSI: 0000000000000000 RDI: 031cb2f7f65ab9c0
Mar 22 07:44:14 c kernel: [ 1754.589766] RBP: ffffb3ed036e7ce8 R08: 0000000000000000 R09: 0000000000001000
Mar 22 07:44:14 c kernel: [ 1754.589767] R10: ffffffffba4530e8 R11: 000000000000000f R12: c73000dd96ae7534
Mar 22 07:44:14 c kernel: [ 1754.589768] R13: 0000000000000000 R14: 031cb2f7f65ab9c0 R15: ffff8dcc4daf00d0
Mar 22 07:44:14 c kernel: [ 1754.589769] FS:  0000000000000000(0000) GS:ffff8dd3d0440000(0000) knlGS:0000000000000000
Mar 22 07:44:14 c kernel: [ 1754.589771] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 22 07:44:14 c kernel: [ 1754.589772] CR2: 000000c000177000 CR3: 000000011a042000 CR4: 0000000000752ee0
Mar 22 07:44:14 c kernel: [ 1754.589773] PKRU: 55555554
Mar 22 07:44:14 c kernel: [ 1754.589774] Call Trace:
Mar 22 07:44:14 c kernel: [ 1754.589775]  <TASK>
Mar 22 07:44:14 c kernel: [ 1754.589777]  ? show_trace_log_lvl+0x1d6/0x2ea
Mar 22 07:44:14 c kernel: [ 1754.589780]  ? show_trace_log_lvl+0x1d6/0x2ea
Mar 22 07:44:14 c kernel: [ 1754.589782]  ? dma_direct_free+0xc3/0x150
Mar 22 07:44:14 c kernel: [ 1754.589785]  ? show_regs.part.0+0x23/0x29
Mar 22 07:44:14 c kernel: [ 1754.589787]  ? __die_body.cold+0x8/0xd
Mar 22 07:44:14 c kernel: [ 1754.589789]  ? die_addr+0x3e/0x60
Mar 22 07:44:14 c kernel: [ 1754.589791]  ? exc_general_protection+0x1c5/0x410
Mar 22 07:44:14 c kernel: [ 1754.589794]  ? asm_exc_general_protection+0x27/0x30
Mar 22 07:44:14 c kernel: [ 1754.589797]  ? __free_pages+0x13/0xc0
Mar 22 07:44:14 c kernel: [ 1754.589798]  ? dma_free_from_pool+0x61/0xa0
Mar 22 07:44:14 c kernel: [ 1754.589800]  dma_direct_free+0xc3/0x150
Mar 22 07:44:14 c kernel: [ 1754.589802]  dma_free_attrs+0x3c/0x60
Mar 22 07:44:14 c kernel: [ 1754.589804]  mlx5_frag_buf_free+0x60/0x80 [mlx5_core]
Mar 22 07:44:14 c kernel: [ 1754.589840]  qp_destroy+0xe4/0xf0 [mlx5_vdpa]
Mar 22 07:44:14 c kernel: [ 1754.589842]  teardown_vq.part.0+0xcd/0x120 [mlx5_vdpa]
Mar 22 07:44:14 c kernel: [ 1754.589844]  mlx5_cvq_kick_handler+0x4ca/0x510 [mlx5_vdpa]
Mar 22 07:44:14 c kernel: [ 1754.589846]  ? finish_task_switch.isra.0+0x70/0x280
Mar 22 07:44:14 c kernel: [ 1754.589848]  process_one_work+0x228/0x3d0
Mar 22 07:44:14 c kernel: [ 1754.589850]  worker_thread+0x53/0x420
Mar 22 07:44:14 c kernel: [ 1754.589851]  ? process_one_work+0x3d0/0x3d0
Mar 22 07:44:14 c kernel: [ 1754.589852]  kthread+0x127/0x150
Mar 22 07:44:14 c kernel: [ 1754.589854]  ? set_kthread_struct+0x50/0x50
Mar 22 07:44:14 c kernel: [ 1754.589856]  ret_from_fork+0x1f/0x30
Mar 22 07:44:14 c kernel: [ 1754.589858]  </TASK>
Mar 22 07:44:14 c kernel: [ 1754.589859] Modules linked in: vhost_vsock vmw_vsock_virtio_transport_common vsock xt_multiport xt_set ipt_rpfilter ip_set_hash_ip ip_set_hash_net ip_set veth ipip tunnel4 ip_tunnel wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel xt_statistic xt_nat xt_mark xt_comment nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nft_counter nf_tables nfnetlink openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 cuse target_core_user uio target_core_mod nvme_fabrics snd_hda_codec_hdmi snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_hda_codec_realtek snd_soc_hdac_hda snd_hda_ext_core snd_hda_codec_generic snd_soc_acpi_intel_match snd_soc_acpi soundwire_bus

Hello @liujiong63,

Thank you for posting your query on our community. To investigate the cause of the host crashes, we need to review your configuration to determine its supportability. Therefore, I would like to request you to submit a support ticket for further troubleshooting. The support ticket can be opened by emailing " Networking-support@nvidia.com ". Please upload the sysinfo snapshot from the host, DPU, and the VM for further debugging. You can find the procedure to capture the sysinfo snapshot in this GitHub link → GitHub - Mellanox/linux-sysinfo-snapshot: Linux Sysinfo Snapshot

Thanks,
Bhargavi