Issues syncing clock and reading clock timestamp at the same time

Hi,

I am using PTP to sync my RDMA HCAs. I am using a variety of NICs (Connect X-5s, X-6s, X-7s, and Bluefield-3 NICs) and using both Infiniband/RoCE.

At the same time, I am also timestamping the cqes to read the time when messages are received for my RDMA application. Occasionally, this seems to result in the QP entering a bad state where the completion queue blocks indefinitely. I believe is due to a race condition where the HCA clock is being written to and read at the same time. I see this issue even when CLOCK_REALTIME_ENABLE is not on or not available which means the driver should be handling clock reads/writes, so all the more reason why there should not be any issues with this. However, any modern NIC should be able to handle reading/writing to the NIC clock at the same time, so I assume Mellanox NICs should be able to do so as well.

This leads me to believe that this is not expected behavior, and in fact, a bug. Has this bug been addressed in some release version of MLNX_OFED, DOCA, or the ConnectX-5/6/7/Bluefield-3 firmware, or are there any plans to do so? Are there any workarounds to avoid this issue?

Thanks.

Hi,

Thanks for your question.
In order to be able to assist you with this scenario, we’ll need to get the exact reproduction steps, relevant logs from each scenario, and investigate it.
To perform the mentioned steps we will ask you to open a support case in Nvidia portal, or send an email to enterprisesupport@nvidia.com, and this case will be handled according to the support entitlement.

Thanks,
Anatoly

Are you referring HCA Core clock or PTP.
PTP will only sync PHC ( PTP Hardware clock) , which CQ Timestamping is done via HCA Core clock.