Performance issues with kernel space drivers

My team is experiencing some performance issues with the rdma drivers included within the linux kernel.

What we are trying is to write a kernel module to utilize RDMA features.

The problem is the send-recv latency of kernel space between the two nodes. It has a significant performance drop on kernel space while user space programs do not experience such problem.

(We tried to get the send-recv latency of user space by executing ib benchmarks like ib_send_lat, and ib_send_bw. By executing these, we simply reach the maximum throughput of our hardware has)

Is there any way to improve our RDMA performance on our kernel module?

Most likely you need to to find where is a bottleneck using some kind of profilers or performance monitor. linux-rdma mailing list might be a better approach to ask for generic questions