To start my development, I’m trying to create a minimal example of Dynamically Connected Transport QPs. For that, one of the examples that I’m using is from NVIDIA itself: Dynamically Connected (DC) QPs - NVIDIA Docs
The above example works on the DC Target side, but I’m facing an error on the Initiator side when creating the QP using the mlx5dv_create_qp function.
The function fails with errno 5 (i/o error) on my Connect-X5 hardware and I’m seeing these lines on dmesg:
[4494044.006304] mlx5_core 0000:af:00.0: mlx5_cmd_out_err:797:(pid 619708): CREATE_QP(0x500) op_mod(0x0) failed, status bad input length(0x50), syndrome (0x2f50ca), err(-5)
[4494044.007456] infiniband mlx5_0: create_qp:3192:(pid 619708): Create QP type 4098 failed
There is no further explanation on the logs that might suggest the cause error. Am I missing something on the initiator side that is not being covered by the example? Thank you.
That’s certainly a step in the right direction, but I could only find PAS in the kernel driver code, so I wonder how this correlated with the user-land code.
It seems that the driver sent to the firmware buffer length parameter insufficient to establish the QP context.
This situation can arise if there is a compatibility issue between the driver and firmware versions. To verify this, please check the Release Notes of the driver, available at: Linux InfiniBand Drivers.
Additionally, improper parameter handling can also lead to this problem.
Thanks @chenh1, we had indeed a firmware issue on some of our nodes. We repeated the test on two Dell PowerEdge systems that are properly configured (driver 5.6-2.0.9 and firmware 16.32.1010) and now dmesg only shows:
[4648076.498624] infiniband mlx5_0: create_qp:3192:(pid 4166615): Create QP type 4098 failed
But there is still a problem when creating the QP for the DCT. You mentioned improper parameter handling. Does this mean device/driver configuration or user code? Because the code was taken from NVIDIA documentation.