Do MLNX have any tools and methods to measure the latency of NIC DMA writing data to memory and NIC send interrupts to core response interrupts?

In my testing, the 100Gb NIC(MT28800 Family ConnectX-5 Ex) was used to connect the tester client to the SUT, as shown below:
case A: [tester0] ↔ [socket0 SUT(1P)]
case B: [tester0] ↔ [socket0 SUT(2P) socket1] ↔ [tester1]
For case A, the performance looks good. For Case B, running two test copies on a 2P SUT did not get twice the expected performance, and the performance was only slightly higher than Case A.
During the testing period, ethtool -S were run on SUT, and from the obtained counter states, it was found that a large amount of growth was tx_pause_ctrl_phy and rx_discards_phy.
The test program has been limited to its respective socket nodes using numactl and has also stop irqbalance
services. I suspect that some kernel threads or network stacks have data/buffer running to remote socket nodes, resulting in longer latency and packet loss. Do MLNX have any tools and methods to measure the latency of NIC DMA writing data to memory, and the latency from NIC send interrupts to core response interrupts?