ConnectX-5 Ex VPI drops packets while performing RDMA write.

Hello,

I am working on RoCE IP core for FPGA and I have some problems regarding dropped packet. Adapter used in this project is ConnectX-5 Ex VPI.

I’m trying to perform RDMA write operation and it works for transfers of size below 256 kB. When I try to transfer more, e.g. 512 kB, adapter receive only 66 packets (this number is constant accross multiple tries) and the rest is (I assume) dropped. This number is taken from available hardware counters which also shows that no error occured. This happens if I’m sending one packet after another without pause between them.

If I insert pause between subsequent packets it gets better and eventually (with increasing duration of the pause) results in success. Unfortunately this results in throughput about 5 % of the bandwidth. I have few ideas about what could cause this kind of behaviour:

  • Adapter is not ready to work in full speed - which seems not very likely to me.
  • PCIe transaction is taking too long to initialize which results in overflow of the internal buffers - I tried to place pause only between the first (as it contains address and length) and the second packet but with the same result.
  • Adapter contains buffers which for some reason overflow - maybe it waits for some kind of trigger before is starts PCIe transaction. Maybe there is a threshold (set to too high value) which triggers PCIe transaction when there is enough data in the buffer.
  • Flow control is needed to correct functioning. This seems like a true to me but not for so small amout of data.

I would really appretiate help in this matter as I know nothing about internal architecture of the adapter.

Thank you very much.

Hi Jakub,

I have reviewed your questions and for further investigation i suggest to open a support case at

networking-support@nvidia.com

According to our records your account have support contract that is expired

In order to renew the support contract , please contact Networking-contracts@nvidia.com

Thanks,

Samer

Hi Samer,

I already found the problem. It was the absence of any kind of control flow mechanism. It works fine now.

Thank you,

Jakub