hang on cm_destroy_id when NIC is down using the MLNX-OFED 4.7-

hi, I am using the newest MLNX-OFED, when the NIC down, we will stop our process(kill -9), but it hang on cm_destroy_id, is it any idea about this?

#0 [ffffb48b2044fa30] __schedule at ffffffffb789d270

#1 [ffffb48b2044fab8] schedule at ffffffffb789d882

#2 [ffffb48b2044fac8] schedule_timeout at ffffffffb78a1622

#3 [ffffb48b2044fb50] wait_for_completion at ffffffffb789eecf

#4 [ffffb48b2044fbb0] cm_destroy_id at ffffffffc05c7ee4 [ib_cm]

#5 [ffffb48b2044fc10] rdma_destroy_id at ffffffffc060c9f0 [rdma_cm]

#6 [ffffb48b2044fc38] ucma_close at ffffffffc0620fc8 [rdma_ucm]

#7 [ffffb48b2044fc60] __fput at ffffffffb727dff0

#8 [ffffb48b2044fca0] task_work_run at ffffffffb70c75a2

#9 [ffffb48b2044fcd0] do_exit at ffffffffb70ae006

#10 [ffffb48b2044fd40] do_group_exit at ffffffffb70ae959

#11 [ffffb48b2044fd68] get_signal at ffffffffb70b9a20

#12 [ffffb48b2044fdf0] do_signal at ffffffffb7024566

#13 [ffffb48b2044ff08] exit_to_usermode_loop at ffffffffb7003c6f

#14 [ffffb48b2044ff30] do_syscall_64 at ffffffffb7003872

#15 [ffffb48b2044ff50] entry_SYSCALL_64_after_hwframe at ffffffffb7a00081

RIP: 00007f9ac82e2603 RSP: 00007f927e3a5a10 RFLAGS: 00000293

RAX: 0000000000000004 RBX: 00007f927e3a5a50 RCX: 00007f9ac82e2603

RDX: 0000000000000020 RSI: 00007f927e3a5a50 RDI: 0000000000000028

RBP: 000000000a8d4320 R8: 0000000000000000 R9: 0000000008e0cc00

R10: 0000000000000000 R11: 0000000000000293 R12: 000000000a8e6580

R13: 00007f927e3a6c20 R14: 000a7ce401fc123d R15: 000000000a7d1e40

ORIG_RAX: 00000000000000e8 CS: 0033 SS: 002b

What is the result if you

  • power on the host

  • Verify that no RDMA related process/service are running ( including yours)

  • systemctl restart openibd

Does it hang or you able to reload the driver? If you able to reload, then something in your application might be broken and the way of killing application need to be reviewed. It could be that after application killed it sill use some HCA resources.