Issue with ConnextX-6 SP in Dell R6525 (AMD EPYC 7542/7742)

Hello Community,

Running an MPI application on Dell R6525 servers with ConnextX-6 SP OEM cards, I see MPI error messages like:

test: Rank 0:311: MPI_Init: ibv_modify_qp(rst2init) failed

test: Rank 0:311: MPI_Init: ibv_create_procqp() failed

Kernel log shows:

kernel: mlx5_core 0000:a1:00.0: mlx5_cmd_check:795:(pid 27817): MANAGE_PAGES(0x108) op_mod(0x2) failed, status bad system state(0x4), syndrome (0xe8912)

kernel: mlx5_core 0000:a1:00.0: reclaim_pages:437:(pid 27817): failed reclaiming pages: err -5

I found a workaround online that made it work, setting boot option “iommu=pt”. But that doesn’t fix the root cause I think. Does enyone can give me a hint how to solve this issue in a better way?

Thanks,

Christian

Hi Christian,

The iommu=pt is prerequisites from AMD in order to use AMD EPYC for MPI tests

i suggest to refer to AMD performance and power optimization guide.

Thanks,

Samer

Hi Samer,

Thanks for the Information. I found it in the guide.

Best,

Christian