I haven’t been able to figure out much on the Remote_Invalid_Request_Error error, but one link on the web (rdmamojo) pointed out that this could be due to qp_access_flags in remote QP wasn’t configured to support this operation), insufficient buffering to receive a new RDMA or Atomic Operation request, or the length specified in a RDMA request.
I have validated all of the above (qp_access flags on the responder has RDMA read enabled, enough buffer to receive RDMA read and length specified for RDMA read request is also fine on the requester). In addition I have also validated the Remote Addr/Rkey, Local Addr/Lkey and length and the entire WQE posted, they all looks fine.
Any idea what else could cause this error (Remote_Invalid_Request_Error) ? Also I could find details on vendor syndrome of 0x8a, is there a way to decode this error for further details on the failure ?
I’m having a similar problem with the RHEL Inbox drivers and NFS over RDMA between some newly installed machines, getting Local Length Errors. While I don’t have a solution for you specifically, I’m wondering, how did you do the validation you mention (qp_access flags, buffer sizes, etc.)? I’m wondering if any of the things you’re mentioning might help me with my problem.
Basically by validating (qp_access_flags. buffer sizes etc.) I mean, I made sure they have the right values. For example qp_access flags was enabled for RDMA read and write,
buffer size in the work request matched with that used for memory registration and so on
I was able to get this issue resolved. The problem was with the “max_dest_rd_atomic” QP attribute. Per documentation, “max_dest_rd_atomic” is “number of RDMA Reads outstanding at any time for this QP as a destination”. Our code was using RDMACM for connection management. The way “max_dest_rd_atomic” is set by RDMACM is via attribute called “responder_resources” sent as an argument “rdma_conn_param” to “rdma_connect”. The argument did not look obvious and hence was not set causing RDMACM to set “max_dest_rd_atomic” to zero. causing RDMA reads initiated to this node to fail.
Basically the syndrome “Remote_Invalid_Request_Error” means lot of issues that are not clearly defined, hence it took us time to figure out the exact issue. This is where I was hoping that “vendor syndrome” might come in handy to figure out root cause for “Remote_Invalid_Request_Error” or similar errors that have multiple failure reasons. Unfortunately “vendor syndrome” doesn’t seem to be exported by Mellanox. It will help if Mellanox could export this error with its corresponding description such that it will help Mellanox RDMA users to debug similar issues.