Measuring RDMA latency from host software to ConnectX-4 egress

I am running some RDMA traffic between 2 Ubuntu hosts with ConnectX-4 LX cards. The end-to-end 1-way latency is of particular interest to us and the number is not good. So I am trying to identify the latency bottleneck by measuring the latency of each component along the path, I am wondering if there is any tool that measures the latency introduced by ConnectX-4 (basically the time the packet is generated by the host software and the time the packet egress from the rNIC). Thanks.