Kernel bypass(LD_preload) TCP/IP Socket send performance optimization

we are doing some profiling with socket send time and found out the “send” takes around 5-7us, the packet size is 215 bytes.

Profiling code

#ifdef SEND_ASYNC_DEBUG
    
loge("SendNext, b4send, {}", tn.rdns() starttime); // 4-150 max=1500ns starttime = tn.rdns();
    
#endif
    
auto n = send(m_Socket, sendPtr, sendSize, MSG_NOSIGNAL);
    
#ifdef SEND_ASYNC_DEBUG
    
loge("SendNext, send, {}", tn.rdns() starttime); // 3000-10000 max=20000ns

start script

cd /opt/abc11 && ulimit -c unlimited && LD_PRELOAD="libvma.so /opt/omdc/libmimalloc.so.2.0" VMA_SPEC=latency VMA_MEM_ALLOC_TYPE=1 VMA_THREAD_MODE=1 VMA_RX_BYTES_MIN=2621440 VMA_RX_BUFS=2000000 VMA_INTERNAL_THREAD_AFFINITY=3  nice -n -20 /opt/abc11/abc /opt/abc11/abc.json

using VMA_stat, we confirm that everything sent are already “offloaded”, so, I assume the

send(xxx,xxx,xxx,xxx)

are using mellanox library instead of <sys/socket.h>

question

  1. Any optimization techniques to reduce the socket transmission time?

we are using VMA_SPEC=latency and running on CentOS and ConnectX-5

I’ve done some research that, TCP/IP java.net.Socket send would take around 5-7us, i thought using kernel bypass would do better (Java TCP/IP Socket write performance optimization - Stack Overflow)

Hi,

Thanks for your questions.
A few points:

  1. first I would remove the call to send and see that the measurement itself does not add latency. latency should be ~0 in this case.
  2. The call to send should go to vma in this case.
  3. java.net.Socket has its price indeed. Inside, Java calls send syscall or any other write syscall which VMA intercepts.
  4. Afaik, Kernel latency is several micros while vma is around 1-2 depends on the setup. So if most of the latency comes from Java then maybe 1-2 micro diff is not seen and falls under 5-7 fluctuation. you can run several tests with and without VMA to see the tendency
  5. A deeper understanding of all the hardware/software components and the scenario will require a support case in Nvidia portal (you can send an email to enterprisesupport@nvidia.com and the case will be handled according to the entitlement)

Best Regards,
Anatoly

@abirman

Thanks for the reply, we are now moving the dummy message approach from testing phase to production. But, we have problem sending dummy messages on the production site to real server

auto n2  = send(m_Socket, sendPtr, sendSize, MSG_NOSIGNAL | VMA SND FLAGS DUMMY); 
auto n = send(m_Socket, sendPtr, sendSize, MSG_NOSIGNAL);

first line always come back with error “Resource temporarily unavailable” while second line has no problem.

  1. we verified “dummy send” capability in HW using the vma_tracelevel= debug approach, confirm QP=1