Kernel bypass(LD_preload) TCP/IP Socket send performance optimization

lawrence124 · December 2, 2024, 6:27am

we are doing some profiling with socket send time and found out the “send” takes around 5-7us, the packet size is 215 bytes.

Profiling code

#ifdef SEND_ASYNC_DEBUG
    
loge("SendNext, b4send, {}", tn.rdns() starttime); // 4-150 max=1500ns starttime = tn.rdns();
    
#endif
    
auto n = send(m_Socket, sendPtr, sendSize, MSG_NOSIGNAL);
    
#ifdef SEND_ASYNC_DEBUG
    
loge("SendNext, send, {}", tn.rdns() starttime); // 3000-10000 max=20000ns

start script

cd /opt/abc11 && ulimit -c unlimited && LD_PRELOAD="libvma.so /opt/omdc/libmimalloc.so.2.0" VMA_SPEC=latency VMA_MEM_ALLOC_TYPE=1 VMA_THREAD_MODE=1 VMA_RX_BYTES_MIN=2621440 VMA_RX_BUFS=2000000 VMA_INTERNAL_THREAD_AFFINITY=3  nice -n -20 /opt/abc11/abc /opt/abc11/abc.json

using VMA_stat, we confirm that everything sent are already “offloaded”, so, I assume the

send(xxx,xxx,xxx,xxx)

are using mellanox library instead of <sys/socket.h>

question

Any optimization techniques to reduce the socket transmission time?

we are using VMA_SPEC=latency and running on CentOS and ConnectX-5

I’ve done some research that, TCP/IP java.net.Socket send would take around 5-7us, i thought using kernel bypass would do better (Java TCP/IP Socket write performance optimization - Stack Overflow)

abirman · December 5, 2024, 10:40am

Hi,

Thanks for your questions.
A few points:

first I would remove the call to send and see that the measurement itself does not add latency. latency should be ~0 in this case.
The call to send should go to vma in this case.
java.net.Socket has its price indeed. Inside, Java calls send syscall or any other write syscall which VMA intercepts.
Afaik, Kernel latency is several micros while vma is around 1-2 depends on the setup. So if most of the latency comes from Java then maybe 1-2 micro diff is not seen and falls under 5-7 fluctuation. you can run several tests with and without VMA to see the tendency
A deeper understanding of all the hardware/software components and the scenario will require a support case in Nvidia portal (you can send an email to enterprisesupport@nvidia.com and the case will be handled according to the entitlement)

Best Regards,
Anatoly

lawrence124 · January 11, 2025, 4:28am

@abirman

Thanks for the reply, we are now moving the dummy message approach from testing phase to production. But, we have problem sending dummy messages on the production site to real server

auto n2  = send(m_Socket, sendPtr, sendSize, MSG_NOSIGNAL | VMA SND FLAGS DUMMY); 
auto n = send(m_Socket, sendPtr, sendSize, MSG_NOSIGNAL);

first line always come back with error “Resource temporarily unavailable” while second line has no problem.

we verified “dummy send” capability in HW using the vma_tracelevel= debug approach, confirm QP=1

Topic		Replies	Views
ib_send_bw performance puzzle Mellanox OFED iterations , bytes	4	3470	April 27, 2016
How to use CPU isolation to reduce network latency. Software And Drivers cores	10	1081	November 10, 2019
How does ib_send_lat measure the latency? Mellanox OFED	3	1350	October 25, 2023
Packet Drops issue Ethernet Adapter Cards	1	440	July 11, 2019
Why the very high MAXIMUM latency in UDP ping-pong test?	1	254	August 24, 2017
Slow ethernet response time with Nano Jetson Nano	17	1563	October 14, 2021
cudaMemcpyAsync decrease the data transfer performance? CUDA Programming and Performance	0	4227	February 1, 2010
PCI Express Latency and how to decrease it CUDA Programming and Performance	7	19483	January 31, 2011
What is the correct driver for ConnectX-4 LX and ConnectX-6 LX Cards? Mellanox OFED	4	1159	September 18, 2023
HDR Infiniband and ConnectX-6 VPI interfaces Software And Drivers infiniband , iterations , bytes	1	749	January 11, 2022

Kernel bypass(LD_preload) TCP/IP Socket send performance optimization

Related topics