Slow ethernet response time with Nano

Hi experts,
we are having big issues getting a reliable response time for tcp communication between a server pc (Windows 10) and a Nano production module (wit Auvidea JN30 Carrier Board Jetpack 4.3). In a test application we are transmitting a message with a certain size in a loop every 100ms to the Nano device. The Nano application immediatly sends this message back to the server.
The time from message transmitted and response received is measured on the server pc:

#Response time for message sent to Jetson Ubuntu client
#message size is in bytes (payload)
size:1 min: 0.94ms max: 1.95ms median: 0.97ms mean: 0.97ms
size:2 min: 0.93ms max: 0.99ms median: 0.96ms mean: 0.82ms
size:4 min: 0.94ms max: 0.98ms median: 0.98ms mean: 0.97ms
size:8 min: 0.92ms max: 0.99ms median: 0.97ms mean: 0.87ms
size:16 min: 0.93ms max: 0.99ms median: 0.98ms mean: 0.97ms
size:32 min: 0.93ms max: 1.00ms median: 0.97ms mean: 0.97ms
size:64 min: 6.78ms max: 371.88ms median: 125.87ms mean: 149.18ms
size:128 min: 14.63ms max: 407.16ms median: 126.03ms mean: 152.62ms
size:256 min: 0.98ms max: 500.45ms median: 51.90ms mean: 112.44ms
size:512 min: 0.95ms max: 508.51ms median: 98.84ms mean: 138.57ms
size:1024 min: 0.92ms max: 560.86ms median: 115.71ms mean: 167.44ms
size:2048 min: 0.96ms max: 637.18ms median: 109.20ms mean: 198.76ms
size:54 min: 0.94ms max: 0.98ms median: 0.98ms mean: 0.97ms
size:55 min: 0.93ms max: 0.99ms median: 0.98ms mean: 0.97ms
size:56 min: 0.93ms max: 1.00ms median: 0.98ms mean: 0.98ms
size:57 min: 0.95ms max: 1.00ms median: 0.98ms mean: 0.97ms
size:58 min: 28.32ms max: 424.25ms median: 110.82ms mean: 146.71ms
size:59 min: 0.93ms max: 396.38ms median: 130.84ms mean: 151.46ms

So a message size of 58 bytes size (and bigger) will cause in issue with the response times.
For testing we exchanged the Nano prod. device and used a different Ubuntu PC and afterwards a Jetson Nano developer board as receiver/responder for the messages. For this cases the big varation of response times does not exist! und is usually around 1-2ms.
What can cause such a behaviour? Is there a magically ethernet setting in L4T to solve our problem? Could it be hardware related?
Happy to hear about your suggestions :)

I do not have an answer, but do have some information you might find useful for this.

On a Jetson (or anything Linux), you can run “ifconfig” and find the “mtu” of a given interface. The default is 1500 bytes on Linux, but I don’t know about Windows. You may want to read up on MTU:
https://en.wikipedia.org/wiki/Maximum_transmission_unit

How this affects your system may depend on the protocol and settings of every hop along the route. In the case of you a switch, then the Jetson, switch, and Windows box settings are relevant. Even if you directly connect, then the Windows box and Jetson qualify as hops along the route. Managed switches or high end switches may have more settings possible in regards to MTU/MRU.

Often (protocols matter) there is a buffer of a given size, and until the buffer fills up (relative to the MTU) transmission will be delayed. If there isn’t enough data, but a timer expires, then the ethernet port will go ahead and send. Hitting the MTU will instantly result in a send (and beyond MTU will fragment and send in multiple packets).

To illustrate, under TCP (not ICMP, which is what ping is), if you were typing a letter in a console and hitting the enter key at the end of a set of small lines of text, then until you type enough lines to fill 1500 bytes (which includes overhead, and so actual text being typed would be less than 1500 bytes), you might not see a send. On the other hand, if you were to embed a “newline” character between lines, and not actually use the enter key, and then paste the entire content, then there would be no delay because it would be considered a single buffer/packet and not multiple smaller packets. Size (relative to MTU) matters.

There may in fact be more than one kind of traffic going on, and a buffer might fill and trigger a send based on combinations filling up a buffer.

If you set a smaller MTU, then you will get lower latency before send if the MTU filling up was the issue. However, if what you send fragments into several smaller packets, instead of sending as a single larger packet, then you waste bandwidth on overhead with each packet.

On the receive side the driver will perhaps trigger based on multiple events. One being a scheduled interrupt on a periodic timer. Another reason might be the NIC buffer filling and trying to make the driver process right then and there.

Intermediate hops may have their own polling or interrupt handling, and may not have data aligned for send/receive at the same moment your data sends/receives (it’s a big world).

If you have not set performance mode, then perhaps doing so would help:
sudo nvpmodel -m 0
sudo jetson_clocks

Otherwise, someone may have to reproduce this and profile where the specific latency is from. If someone else can reproduce this, then this is likely a driver/software setting issue. If not reproduced by someone else, then this may be some setting related to your specific network.

Thanks a lot for the detailed information.
Just to give some more insight about our setup: the communication path between Windows-PC and Jetson consists of dedicated port on Windows-PC (no other traffic except to Nano) and an Gbit Ethernet Switch. A direct connection without switch did not change the behaviour. For all test cases (replaced Windows-PC, replaced Jetson) this configuration was the same. The MTU size is set to 1500 on Linux machine and the TCP_NODELAY is enabled.
Interestingly today I tested a cheap USB3.0-Ethernet-Adapter connected to the Nano production device and it did not! show the odd behviour of varying response times (same eth settings used).

If you monitor “dmesg --follow” while plugging in a USB ethernet dongle, then something should show there to indicate if the failure is a technical issue or if instead it just needs a driver.

If you run “ifconfig eth0” (or name some other network interface if that is not correct), then you should see some statistics. Anything on the interface related to dropped, overruns, errors, so on, would be of interest.

One of the problems with finding delays from ping is that you don’t necessarily know which end (or possibly both ends) of the connection is influencing this. On the other hand, do you have another Linux box you can ping with instead of Windows? Seeing the response from a Linux PC, and knowing if the problem follows the Jetson or instead it follows the Windows machine, would perhaps offer a direction to continue.

Instead pinging the Jetson Nano production device we measured the response time using a Ubuntu PC and a Jetson Nano developer device as clients. The issues with slow response times do not occur in these cases. The problems are following the Jetson production device and seemingly even the builtin Realtek-Ethernet -Controller. The USB3.0 Ethernet adapter does not show the problem using same machine.

After running some tests with sending and receiving data:

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.2.115  netmask 255.255.255.0  broadcast 192.168.2.255
        inet6 fe80::204:4bff:fee9:c4f5  prefixlen 64  scopeid 0x20<link>
        ether 00:04:4b:e9:c4:f5  txqueuelen 1000  (Ethernet)
        RX packets 276933  bytes 42250906 (42.2 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 388468  bytes 278882228 (278.8 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 150  base 0xa000

Generated small Python application which shows mentioned problems with messages >=70bytes:

import socket
import time
import numpy as np

class MySocket():
    def __init__(self,sock=None):
        if sock is None:
            self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            self.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
        else:
            self.sock = sock
            self.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
    def connect(self,host,port):
        self.sock.connect((host,port))
    def listen(self,host,port):
        self.sock.bind((host, port))
        self.sock.listen(5)
    def accept(self):
        (clientsocket, address) = self.sock.accept()
        return (clientsocket, address)
    def send(self,msg):
        msglen = len(msg)
        sent = self.sock.sendall(msg)      
        if sent == 0:
            raise RuntimeError("socket connection broken")
    def recv(self):
        # just assume the messages are small for the moment
        # and we receive just one chunk
        return self.sock.recv(4096) 
    def ping(self,msg):
        self.send(msg)
        recv_data = self.recv()
        if len(msg) != len(recv_data):
            print(f'Package sizes do not match! {len(msg_b)} to {len(recv_data)}')

# Server app will send some messages with different sizes to the connected client
if __name__ == "__main__":
    myserver = MySocket()
    myserver.listen('',8754)
    while True:
        (clientsocket, address) = myserver.accept()
        myclientSocket = MySocket(clientsocket)
        #build up a list of messages with different sizes
        msg_list = []
        for k in range(4,10):
            msg = 'p'*(2**k)
            msg_b = msg.encode()
            msg_list.append(msg_b)
        # add a couple of messages with size in problematic range
        for k in range(68,73):
            msg = 'p'*(k)
            msg_b = msg.encode()
            msg_list.append(msg_b)
        # in productive environment we will probably send a message each 0.2 to 1.0sec to the client
        # to simulate this, sleep a little bit before each send 
        sleep_time = 0.2
        #send each message couple of times and measure response time
        for msg_b in msg_list:
            timings=[]
            for l in range(0,10):
                time.sleep(sleep_time)
                start_time = time.perf_counter()
                myclientSocket.ping(msg_b)
                timings.append((time.perf_counter() - start_time)*1000)
            print(f'size:{len(msg_b)} min: {min(timings):.2f}ms max: {max(timings):.2f}ms median: {np.median(timings):.2f}ms mean: {np.mean(timings):.2f}ms')
               

''' 
# Client app will receive a message and send it back immediately
if __name__ == "__main__":
    myclient = MySocket()
    myclient.connect('192.168.2.134',8754)
    while True:
        msg = myclient.recv() 
        if msg is None:
            continue
        myclient.send(msg)
'''

This is good information. One thing which still is not clear: Is the Python program used for this in all cases with the issue? I am wondering if we need to distinguish between what might be a Python latency, versus network driver issue. The actual ifconfig shows there is no error and networking, and if the driver is at fault for the latency, still has basic correct function.

One thing I am considering is that the message size is not the same as the packet size since there is an 8-byte ICMP header. The default size of 56, when combined with the header, ends up being a 64-byte ping. I don’t know if this is related, but I suspect that there is some inefficiency in the non-default sizes. What happens when you don’t use the python program, but instead use command line? Example (I named the IP address you have, but adjust it for the outside computer, and then try again in reverse from outside to Jetson…this is a 100 ping count just for a round number to look at):
ping -s 64 -c 100 192.168.2.115

gives:

72 bytes from 192.168.2.115: icmp_seq=1 ttl=64 time=0.381 ms
72 bytes from 192.168.2.115: icmp_seq=2 ttl=64 time=0.491 ms
72 bytes from 192.168.2.115: icmp_seq=3 ttl=64 time=0.371 ms
72 bytes from 192.168.2.115: icmp_seq=4 ttl=64 time=0.469 ms
72 bytes from 192.168.2.115: icmp_seq=5 ttl=64 time=0.459 ms
72 bytes from 192.168.2.115: icmp_seq=6 ttl=64 time=0.308 ms
72 bytes from 192.168.2.115: icmp_seq=7 ttl=64 time=0.483 ms
72 bytes from 192.168.2.115: icmp_seq=8 ttl=64 time=0.405 ms
72 bytes from 192.168.2.115: icmp_seq=9 ttl=64 time=0.378 ms
72 bytes from 192.168.2.115: icmp_seq=10 ttl=64 time=0.418 ms

ping -s 82 -c 100 192.168.2.115 gives:

90 bytes from 192.168.2.115: icmp_seq=1 ttl=64 time=109 ms
90 bytes from 192.168.2.115: icmp_seq=2 ttl=64 time=305 ms
90 bytes from 192.168.2.115: icmp_seq=3 ttl=64 time=0.401 ms
90 bytes from 192.168.2.115: icmp_seq=4 ttl=64 time=295 ms
90 bytes from 192.168.2.115: icmp_seq=5 ttl=64 time=131 ms
90 bytes from 192.168.2.115: icmp_seq=6 ttl=64 time=294 ms
90 bytes from 192.168.2.115: icmp_seq=7 ttl=64 time=0.473 ms
90 bytes from 192.168.2.115: icmp_seq=8 ttl=64 time=263 ms
90 bytes from 192.168.2.115: icmp_seq=9 ttl=64 time=130 ms
90 bytes from 192.168.2.115: icmp_seq=10 ttl=64 time=260 ms

The part which stands out most to me is that in the “good” case the pings have a fairly narrow standard deviation. In the case where there is an issue, the variation swings rather far…but at times is just as fast as the “good” case.

Before running the “bad” case, do this on the Jetson and see if it helps:

sudo nvpmodel -m 0
sudo jetson_clocks

(you’ll probably want to reboot after the test because this forces it into max performance)

If latency remains high and/or high standard deviation when max performance, then I’d have to conclude something is causing the driver to delay. If max performance mode helps, then it may just be some sort of power savings mode getting in the way.

Did not Help. Same high deviation in latency.
Interestingly in the python script reducing the wait time before sending the messages, i.e. to a minimal value like 0.001, the high latency is suddenly gone…

ping -s 82 -c 100 -i 0.2 192.168.2.115 gives still a high max latency, but smaller std deviation:

90 bytes from 192.168.2.115: icmp_seq=1 ttl=64 time=113 ms
90 bytes from 192.168.2.115: icmp_seq=2 ttl=64 time=0.415 ms
90 bytes from 192.168.2.115: icmp_seq=3 ttl=64 time=0.466 ms
90 bytes from 192.168.2.115: icmp_seq=4 ttl=64 time=201 ms
90 bytes from 192.168.2.115: icmp_seq=5 ttl=64 time=0.558 ms
90 bytes from 192.168.2.115: icmp_seq=6 ttl=64 time=201 ms
90 bytes from 192.168.2.115: icmp_seq=7 ttl=64 time=0.450 ms
90 bytes from 192.168.2.115: icmp_seq=8 ttl=64 time=201 ms
90 bytes from 192.168.2.115: icmp_seq=9 ttl=64 time=0.470 ms
90 bytes from 192.168.2.115: icmp_seq=10 ttl=64 time=0.552 ms
90 bytes from 192.168.2.115: icmp_seq=11 ttl=64 time=103 ms
90 bytes from 192.168.2.115: icmp_seq=12 ttl=64 time=0.483 ms
90 bytes from 192.168.2.115: icmp_seq=13 ttl=64 time=0.431 ms
90 bytes from 192.168.2.115: icmp_seq=14 ttl=64 time=0.476 ms
90 bytes from 192.168.2.115: icmp_seq=15 ttl=64 time=0.442 ms
90 bytes from 192.168.2.115: icmp_seq=16 ttl=64 time=201 ms
90 bytes from 192.168.2.115: icmp_seq=17 ttl=64 time=0.473 ms
90 bytes from 192.168.2.115: icmp_seq=18 ttl=64 time=203 ms
90 bytes from 192.168.2.115: icmp_seq=19 ttl=64 time=1.03 ms

100 packets transmitted, 100 received, 0% packet loss, time 19900ms
rtt min/avg/max/mdev = 0.388/48.239/202.366/71.980 ms, pipe 2

with sudo ping -s 82 -c 100 -i 0.1 192.168.2.115:

100 packets transmitted, 100 received, 0% packet loss, time 10014ms
rtt min/avg/max/mdev = 0.376/22.094/101.734/38.323 ms, pipe 2

with sudo ping -s 82 -c 100 -i 0.01 192.168.2.115:

100 packets transmitted, 100 received, 0% packet loss, time 1092ms
rtt min/avg/max/mdev = 0.377/0.792/11.186/1.815 ms, pipe 2

Very odd, the max latency time for these cases is around interval time of ping.

Windows? No wonder your latency is terrible. Windows doesn’t guarantee any kind of low latency. Turn off all the crap, virus scanners, unneeded background services etc. Raise the ethernet priority, set a cpu affinity, etc.

On the nano, install the real-time kernel extension and tweak the settings.

Windows is already excluded as source of the problems. Same issues existing if the ping is sent by a Ubuntu PC to the Jetson Device.

And? Did you not read further down?

If you suspect the nano, like i posted, install real time kernel extension and tune the network.

This is an indication that the driver is waiting for more data before sending. Typically this is a setting to make average bandwidth more efficient…fewer packets for a given amount of data implies less overhead from packet headers (so wait for more data to fill before sending, and only send with less than full if a timer has expired). When the data waits to fill up the buffer, then so does the wait before the send starts. This is probably something which could be tuned, but whether or not it makes sense to bother depends on the actual final data and situation.

I am going to suggest that this may not actually be an issue, but that if you get a specific condition where you have issues, then that issue might be tuned or adjusted for. Someone could investigate the high standard deviation for ICMP of those packet sizes, but I have doubts that this would actually change real world performance unless the real world case is exactly the same as the ICMP ping with that packet size.

If you test with a real world situation you are having problems with, then I am sure there would be a practical way to examine and improve the situation, but I think this particular issue is just the design for “general case” not being as efficient for that particular ICMP size.

So there is some sort of timeout in the driver which will send out the (small) message if no new data was received during a specific time?
The real world application will send small messages around 100 Byte every 100 to 500ms to the Jetson Nano. So latency is much more important than bandwidth for this. I hoped that disabling Nagle Algorithm would help, but it did not.

Do you have concrete ideas/configs in mind which could “tune the network” for low latency application?

Correct. Filling up a buffer triggers a send. Details depend on a lot of things, e.g., protocol. YMMV.

You will have to check specific situations to know if this will affect you. Nagle algorithm won’t have any effect on this in most cases, although a packet resend counts towards filling a buffer. The Nagle timer is independent of the ethernet driver/stack, and takes place for confirmation issues rather than for original send issues.

All tuning depends strictly on the specific data, protocols, so on. You might be able to set MTU to a small value, e.g., 512 bytes (I don’t remember the minimum MTU), and make it so that a smaller amount of data in a buffer results in a send trigger. You could also append NULL bytes or byte patterns to messages to make sure the message size plus pattern plus header equals the MTU.

Changing MTU will be the same on a Jetson as it is on desktop Ubuntu, so any docs on the topic should work. Do keep in mind that your application will not be using ICMP (unless you are doing something bizarre).

Did some additional tests with increasing packet size. The latency for messages bigger than 4432 bytes is in normal range:

ping -s 4432 -c 10 192.168.2.115
...
--- 192.168.2.115 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9013ms
rtt min/avg/max/mdev = 38.541/101.733/175.693/48.736 ms

and

 ping -s 4433 -c 10 192.168.2.115
...
--- 192.168.2.115 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9199ms
rtt min/avg/max/mdev = 0.445/0.494/0.515/0.018 ms

I will check if tuning the MTU or maybe buffer sizes on server and client have impact on the latency. As you suggested a fix would be to pad bytes to ensure message size >=4433.