Running an MPI application on Dell R6525 servers with ConnextX-6 SP OEM cards, I see MPI error messages like:
test: Rank 0:311: MPI_Init: ibv_modify_qp(rst2init) failed
test: Rank 0:311: MPI_Init: ibv_create_procqp() failed
Kernel log shows:
kernel: mlx5_core 0000:a1:00.0: mlx5_cmd_check:795:(pid 27817): MANAGE_PAGES(0x108) op_mod(0x2) failed, status bad system state(0x4), syndrome (0xe8912)
kernel: mlx5_core 0000:a1:00.0: reclaim_pages:437:(pid 27817): failed reclaiming pages: err -5
I found a workaround online that made it work, setting boot option “iommu=pt”. But that doesn’t fix the root cause I think. Does enyone can give me a hint how to solve this issue in a better way?