I wanted to ask a couple of questions about using Core-Direct/cross channel communication. The documentation I’ve found is a little bit sparse, and I was hoping you could fill in a couple gaps. Thus far, I’ve been able to get a simple experiment up and running when when using a queue pair with managed sends, but when using both managed sends and managed receives I get the strange error described below in (4). I think it might be related to parts of the API that I don’t fully understand.
- What is the purpose of the IBV_EXP_SEND_WAIT_EN_LAST flag? Is it used with SEND_ENABLE/RECV_ENABLE, CQE_WAIT, or some combination? Does it differ depending on whether I post sends using ibv_exp_post_send or ibv_exp_post_task?
- The enable and wait opcodes include a “count” field that indexes into work requests. Is the first work request/completion always numbered 1, with subsequent ones linearly increasing? Or does this sometimes reset, like between ibv_exp_post_task() calls or perhaps decrease after some requests have been handled? Are the numbers consistent between ibv_exp_post_send or ibv_exp_post_task?
- How do the CALC operations work, and what is their intended purpose? Specifically, where are they sourcing their inputs from and writing their outputs to?
- Do I have to adjust the code that detects errors when polling for completions? In my case when polling the manager QP’s CQ, I’m seeing ibv_poll_cq return opcodes that don’t match the operations I posted. By tracking wr_id’s I can see that when a IBV_EXP_WR_CQE_WAIT fails, the opcode reported for it is RDMA_WRITE. Relatedly, what would getting a IBV_WC_LOC_QP_OP_ERR (2) error mean in this context?
- Should it be possible to use cross channel communication with RoCE? Both my own code and the example code provided with the Mellanox OFED distribution get an error about the network being unreachable when attempting to create the “loopback” master QP.