Hi,
I am using PTP to sync my RDMA HCAs. I am using a variety of NICs (Connect X-5s, X-6s, X-7s, and Bluefield-3 NICs) and using both Infiniband/RoCE.
At the same time, I am also timestamping the cqes to read the time when messages are received for my RDMA application. Occasionally, this seems to result in the QP entering a bad state where the completion queue blocks indefinitely. I believe is due to a race condition where the HCA clock is being written to and read at the same time. I see this issue even when CLOCK_REALTIME_ENABLE is not on or not available which means the driver should be handling clock reads/writes, so all the more reason why there should not be any issues with this. However, any modern NIC should be able to handle reading/writing to the NIC clock at the same time, so I assume Mellanox NICs should be able to do so as well.
This leads me to believe that this is not expected behavior, and in fact, a bug. Has this bug been addressed in some release version of MLNX_OFED, DOCA, or the ConnectX-5/6/7/Bluefield-3 firmware, or are there any plans to do so? Are there any workarounds to avoid this issue?
Thanks.