Serial Latency Causing Timeouts Across System Under High Load

We are using an Orin AGX with multiple ethernet cameras along with a few serial devices. Under high load, we are seeing serial delays that result in multiple slowdowns on our system.

These threads detail a similar issue along with an outdated fix to the problelm we have seen.

[PATCH v3 1/2] drivers/tty: refactor functions for flushing/queuing work — Linux Serial

[PATCH v3 2/2] drivers/tty: use a kthread_worker for low-latency — Linux Serial

Is this a known issue that has a patch?

We are running our Orin AGX with L4T 35.3.1 using a ConnectTech Rogue carrier board.

*** Please note that this reply is generated by LLM automatically ***

The issue you’re experiencing with serial delays on your Jetson AGX Orin system, particularly under high load, is a known problem. The threads you mentioned detail a similar issue and propose outdated fixes. However, I’ll provide a comprehensive overview of the problem and potential solutions.

Problem Context

The Jetson AGX Orin series, like other embedded systems, uses a serial console for communication with serial devices. Under high load conditions, the serial console can become a bottleneck, leading to delays and slowdowns in the system. This issue is exacerbated when multiple serial devices are connected to the system.

Similar Issues and Outdated Fixes

The threads you mentioned discuss similar issues with serial delays on Linux systems. The proposed fixes involve refactoring functions for flushing and queuing work (PATCH v3 12) and using a kthread worker for low-latency serial communication (PATCH v3 22). However, these fixes are outdated and might not be applicable to your specific system configuration.

Potential Solutions

To address the serial delays on your Jetson AGX Orin system, consider the following potential solutions:

  1. Update to the latest L4T version: Ensure you’re running the latest version of L4T (Linux for Tegra) on your Jetson AGX Orin. NVIDIA regularly releases updates that might include fixes for serial communication issues.
  2. Configure serial console settings: Verify that your serial console settings are optimized for your system’s workload. You can try adjusting the serial console baud rate, parity, and stop bits to see if it improves performance.
  3. Use a serial console with a higher priority: Assign a higher priority to the serial console process to ensure it receives sufficient CPU resources, even under high load conditions.
  4. Implement a serial console buffer: Consider implementing a serial console buffer to cache incoming serial data, reducing the likelihood of delays and slowdowns.
  5. Evaluate alternative serial console implementations: Research alternative serial console implementations, such as the tty driver, which might offer better performance and lower latency.
  6. Consult with NVIDIA and ConnectTech: Reach out to NVIDIA and ConnectTech (the manufacturer of your carrier board) for guidance on resolving serial delays on your specific system configuration. They may be able to provide additional insights or recommend custom solutions.

Comparison with Other Jetson Modules

While the Jetson AGX Orin series is a powerful module, its serial communication capabilities might differ from other Jetson modules. For example, the Jetson Xavier NX and Jetson Nano modules have different serial console configurations and might not exhibit the same issues. If you’re experiencing similar problems on other Jetson modules, you can try comparing their serial console settings and configurations to identify potential differences.

In summary, while the issue you’re experiencing with serial delays on your Jetson AGX Orin system is a known problem, there are potential solutions to explore. By updating to the latest L4T version, configuring serial console settings, and implementing alternative solutions, you may be able to mitigate the delays and slowdowns. If you’re still experiencing issues, consult with NVIDIA and ConnectTech for further guidance.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

Perhaps not the best solution, but if you have CTS/RTS flow control available, then you could enable that to help (not all serial adapters have the CTS/RTS pins; of those that do, sometimes the device itself won’t support CTS/RTS custom settings). This wouldn’t increase speed, but it would help with lost data and timeouts.

Hi,
Certain issues are discovered and listed in
Making sure you're not a bot!

This it not listed. Would suggest try latest Jetpack 5.1.5 or 6.2.1 and see if the issue persists.

@linuxdev @DaneLLL
Thanks for replies! I can check if CTS/RTS control flow is available although I don’t believe it is.

Unfortunately, updating our jetpack version isn’t possible within our time constraints but something considered for future development work.

I think we are still concerned about the serial delays being related to the thread priority being set to low for kworker threads handling our serial lO. On our system, when we utilize real time priority for our threads, we see huge delays that are greater than 250ms while the system is under heavy load. Are there any ways that we can set raise the priority of our utilized kthreads?

A few notes:

  1. We are running our main application using threads with SCHED_FIFO and set to a high priority
  2. We can see in top -Hthe threads of our process have a high priority number (example screenshot below, sorted by priority)
  3. We had previously applied the fixes from the two links Mylo posted above (re-compiled kernel with those fixes) in L4T 32.6.1 and were able to get low latency serial working. In L4T 35.3.1 it appears the kernel is different, and we no longer know how to easily apply those fixes.

Screenshot showing priorities of our main App:

A couple of questions:

  1. Anyone know if this has been fixed/improved in newer L4T releases?
  2. Could this also affect TCP data incoming to the system as well? We are also potentially seeing some latent TCP data and are wondering if the same cause is at play.

Hi,
There are two known issues which may be related:

  1. GUI HDMI setting affect XFI performance
  2. An issue about UART DMA
    [PATCH 1/2] dmaengine: tegra: Fix residue when xfer count = bytes_req
    [PATCH 2/2] tty/serial: serial-tegra: DMA improvements

The two issues are fixed in later releases, so would suggest upgrade to latest release and try.

Hi DaneLLL,

Thanks for the response! Do you know the version these issues were fixed in?

  1. We aren’t using HDMI at all, could this still be affecting us?
  2. Are we able to backport those patches to L4T 35.3.1?

Thanks again!

And to be clear, the main problem we are experiencing is latency. We don’t see any other issues such as dropped data or application crashes.

Thanks again,

Hi,
We strongly recommend upgrade to Jetpack 5.1.5 or 6.2.1, and run your use-case.

If you prefer to manually merge the patches, you may sync the kernel source code of r35.6.2 to get the patches:
Kernel Customization — NVIDIA Jetson Linux Developer Guide 1 documentation

I probably can’t answer this in a practical way, but do you create the kthread? Or is this a preexisting kthread?

Also, if it is a kthread, then it could probably be scheduled on its own core. In the case of of true realtime there would be no cache involved in order to achieve deterministic timing. However, I don’t know what your actual implementation would do since Jetson CPUs are not Cortex-R (except for hidden Audio Processing Engine and Image Signal Processor). You could certainly test that for a kthread.

That said, sometimes the a kthread begins its life due to hardware from a hardware IRQ starting that kthread, or the kthread might need to wait with a result in order to feed a hardware device’s hardware IRQ. Probably it is just a kthread, but there might be more involved if you don’t know where the kthread is started or what it fees.

Hardware IRQs tend to all run only on CPU0 (not all, but most). CPU0 will have a large impact on anything which requires a lot of hardware IRQs. You can’t really know for your specific case without some detailed profiling.

How to change that kthread might depend on whether you’ve added the RT kernel, and whether your own driver for some hardware or kthread is something you have control over or if it is from some existing kernel code. It’s really difficult to say.

Cam you provide a more detailed list of the flow of your software and that kthread?

Hi Linuxdev!

We aren’t spawning any kthread ourselves. We are using L4T 35.3.1 without any kernel or driver modifications.

We simply open serial port via openor open socket via socket.

We are then reading serial/TCP data via read or recv.

In both cases, during situations of high CPU load we see latency, where both read/recv have pauses of ~100ms-1000ms, even though we know the other side is sending data at high rate (200Hz).

We are hoping there are fixes similar to these that we could apply in the newer L4T 35.3.1 kernel:
[PATCH v3 1/2] drivers/tty: refactor functions for flushing/queuing work — Linux Serial
[PATCH v3 2/2] drivers/tty: use a kthread_worker for low-latency — Linux Serial

Thanks again for the assistance

What serial port do you open with “open”? What socket if you open via “socket”? It looks like a network device instead of serial UART, and so everything changes. For example, it might be data starvation or waiting to send based on networking and not the kthread (a kthread might wait for something and be what shows up, but knowing why it is pausing may be unrelated to the priority of the kthread itself). If there is such an issue with some other bottleneck and it just shows up at the kthread, then there is no fix for the kthread other than fixing the bottleneck. The above details I’m asking for might offer more clues.

Hi Linux dev,

These are two separate devices we are seeing latency with. Not sure if the root cause is the same, but we see latency on both when under high CPU load.

For serial we are opening /dev/ttyTHS0

For socket, we are connecting to a remote TCP server at some <IP_ADDRESS:PORT>

Thanks again!

Question for @DaneLLL related to this: Can the xhci-hcd (GICv3) have its affinity changed from CPU0 to another CPU core? If so, then latency related to the hardware IRQs colliding on CPU0 could be taken out of the equation. So far as I know though this cannot change cores, and the scheduler would just reschedule back to CPU0. Don’t know.

Btw, you should never adjust priorities of a process to more than something around -4 (-5 would be more in this case). You’re going to end up with a priority inversion somewhere at times and it is actually going to occasionally do the opposite of what you want. Knowing a chain of dependencies and adjusting each part a small amount is more likely to do what you want.

Is it possible the latency is dependent upon the communications with the TCP remote server? If that is the case, then it is possible that latency increases when network time delays increase. This in turn could be the TCP/IP stack on the Jetson and not just the remote end. I’m not sure how to test this, but what kind of data rate and data size is on this network traffic? What happens if that network traffic is delayed? Is there dummy data you could use in its place after creating a local loopback socket to address 127.0.0.1? I ask this because going through the Ethernet PHY would involve hardware which uses CPU0 and would load this down, whereas a purely software socket would use kthread soft IRQs. The soft IRQ can be set for affinity on another CPU core, and the priority dramatically increased without issues. A small program which echos or uses test data which repeats, if your program can work with that, could be compared when going over Ethernet versus a local socket (you’d use the same data at each end)

Is there particular data you can use as repeating dummy data? If you can find a set of data which you can repeat which the problem shows up under (even if it is many MB of data), then you can actually test. If you can isolate to a smaller test dummy data, then it gets closer. This is in fact something NVIDIA might be able to use to recreate this, but it is a dual purpose test to offload to a kthread on a different core and to repeat data to narrow down on it.

If this were just user space I’d be tempted to let it run and profile it with gprof, but when you start getting kernel latency, that’s out of the picture (mostly, there are cases though where gprof of user space would tell you which system call is related; the system call is specific code in the kernel we could look at).

Obviously there are difference between the L4T R35.x and R36.x kernels, but that doesn’t do much good if you can’t see where the issue is.

Hi,
On Jeptack 6, there are patches for shifting interrupts of PCIe interface to another CPU core:
Making sure you're not a bot!

On Jetpack 5.1.5, the commits are included into K5.10.

Thanks linuxdev and DaneLLL

I think we are converging on the same answers. A few follow up questions/comments

  • @linuxdev Where did you get the max priority of -4 from? Does that number correspond to something special?
  • @linuxdev We are doing exactly as you suggested. We have a server that is sending dummy data at 200hz. We then run stress-ng on our system and see latency start to spike up to high values (example plot below showing latency in ms, where stress-ng was run for 30s in the middle of the test).
  • We have experimented with changing CPU affinity and shuffling things around to be pinned on various cores, but still see problems. Our theory is that its kworker threads themselves that pull data out of buffers that are getting choked out by higher priority threads. We don’t have any control of which cores the kworker threads run on that we know of.
  • @DaneLLL Thanks for the links, although that doesn’t seem applicable here? We are running L4T 35.3.1.

Thanks again for looking into this with us

Hi,
Certain issues are fixed in later release, would suggest test your use-case on developer kit with Jetpaack 5.1.5 or 6.2.1.

The max priority of -4 is mainly just from experience. If you have system processes which are less priority than the -4, then your priority will tend to run first (or in hard real time will run first). You’re not really supposed to be competing with system processes, but instead competing with user space processes. Let’s say for example that your process uses disk access, but your process has a higher priority than disk access, then all of a sudden you instead slow down due to a priority inversion (and the prerequisite process will wait until it is very late…e.g., a timeout…or until something unblocks it). Even if your process does not directly block the system in an outright priority inversion it might do so for a library call which in turn gets blocked, or some other kthread which is a side-effect of running your process at a higher priority. I’ve found that there is rarely a problem at -4, and you can end up changing latencies and averages too much when you arbitrarily go to too high of a priority. Also, the scheduler can sometimes learn based on process pressure depending on the scheduler, so it is a bit of an art and not entirely science. I also have not found a priority more than -4 actually helps much…if it isn’t about competition, and is just a case of inefficient code, then you are just making the inefficiency run more often which won’t fix it.

Something I haven’t looked at is whether networking is polled or on a hardware interrupt, so don’t take the following details with too much guarantee, but it will still have an idea behind it which is very likely relevant and correct. When it comes to a kthread typically they will have a polling rate on a timer for some of them at 1000 Hz (some might be triggered at a different rate by some other process or thread, it is the scheduler with the final authority on running). Hardware itself tends to be from a hardware IRQ (typically bound to CPU0; you can get hardware IRQ starvation when flooded with hard IRQs to delay this). Suppose you get a priority inversion with networking; you could end up making networking time out or wait longer because of another process of higher priority.

Networking itself has some fascinating possibilities. Please tell me what network protocols are used for this, e.g., is it TCP, UDP, multicast, etc.? Networking tries to be a efficient or “nice” to upstream and/or downstream nodes. In most cases you end up with some buffer with a chunk of data, and the chunk is sent in a burst. This can be the actual payload, or it can be a fragment of a payload which the other end will try to keep in order and reassemble before passing on to your user space program (or to most of the kernel space). TCP tries to help with this, but I’m guessing that if this is on a local network it might be UDP. The gist of it though is that if a buffer being sent does not reach a certain size, then at some polling rate the network send will wait in the hopes of getting more data before sending and reducing fragmentation and overhead of too many small packets. There would be a delay. Packets which are too large for the frame are fragmented into subsets of the data, and those subsets which fill the buffer are sent immediately without waiting. The final buffer of fragments for data needing fragmentation can still wait just like the individual frames which are too small and which wait for a timer (they would have been sent if more data arrived to fill the buffer, but it is the last buffer of the fragments and might not be less than the exact size of the buffer).

Keep in mind that if you add something of higher priority, then that wait timeout for sending small packets can increase.

In TCP/IP there is a rather fascinating set of possibilities, although it depends a lot on your actual data. We would need to know far more about data size and network settings. The MTU (maximum transmit size) and MRU (maximum receive size) work together, but MTU is the one which might wait to send a small amount of data. There are lots of ways to look at network metrics, and unless we are talking about your specific data with your specific protocol it may not make sense to go into much detail, but consider checking MTU and queue length with “ip link show <optionally name the specific interface>”. Or perhaps looking at a fragmentation count using “netstat -sbefore and after one of your tests (you’re interested in see how fragmentation changed during the test since it is a count of fragmented packets).

  • iplink show (compare MTU minus overhead to your data send size).
  • netstat -s (look for increases in fragmentation after running your test for a few minutes when latency has gone up).

A perfect solution is when your data is the exact size of what the buffer wants to be filled, including any size for overhead being accounted for. In turn that size would be better if it is exactly the size of the next hop in the route. Sometimes adding unused NULL bytes to the end of data to fit that size, and then dropping of those NULL bytes at the other end, is faster than sending less data since less data won’t cause an immediate send (more data can produce less latency).

If you were to use some old text-based application to send and receive typed in messages, e.g., the old IRC or some old text-based games from the 1990s, and if your MTU was 512 or some multiple, then you’d have more latency than if you were to drop your system’s MTU to 296 bytes (I’m looking for the size which includes overhead of 40 bytes since it is TCP; payloads tend to be even powers of 2, and so 256+40=296). Because the latency of data sent is more important in this case than average throughput it is better to use the smaller MTU in hopes of filling the buffer before a timeout send. More packets implies less efficiency but better latency results.

There is the case for sending larger frames of data. Fragmentation and reassembly has a lot of possibilities for problems. Data fragments may arrive out of order and some might need to wait for others to reassemble. Some fragment might be lost and either the entire data payload retransmitted or at least a fragment retransmitted, which is an enormous hit on latency and efficiency. If we are sending data in bursts of 20,000 bytes, then a 1500 byte MTU will get a lot of fragmentation and then reassembly at the other end. There will be lots of checksums involved in kernel space (some network hardware has no overhead to the Linux kernel when checksums are performed by the NIC itself). In that case one might be better off enabling jumbo frames with a 64k byte size, and thus sent as a single packet. On the other hand, if the next hop in the route does not support jumbo frames, then you will just get the same fragmentation over the network route and jumbo frames might not really help at all. When you are using a private network with communications from host to switch to host, and no intervening hardware, then you can control this and use jumbo frames at all ends (assumes the switch supports that). You’d still possibly have slight latency added waiting to fill a 64k buffer, but it would less than fragmenting several times and reassembling. Better yet, in this case, maybe use jumbo frames in combination with NULL byte padding to fill a frame and send a frame immediately not waiting to possibly add more to the buffer for efficiency.

What is “best” depends on so much it is hard to fine tune without exact details in networking. MTU, if reduced, will always reduce a packet size, and perhaps improve latency in some cases. MRU is up to the other end, and if MRU is violated, the packet might be completely discarded. Most of the time a node uses the same MRU as the MTU. Consider though that even if you enable jumbo frames, if the other end has an MRU of less than that, that the jumbo frame is going to be fragmented before sending despite jumbo frames being enabled. MRU and MTU work together and you will use the minimum size between the two.

Between hardware IRQs on networking, soft IRQs which might be involved in things like checksums and reassembly or fragmentation, it is really easy for another process with a higher priority to change things completely in how networking behaves even if your particular process does not seem to be related; priority among user space processes isn’t such a problem, but you have to use the root user to set a priority higher than 0 (more negative “nice”, e.g., “-1”), and there is a reason for that. Once you get into those higher priorities you are competing with kernel space and not just user space.

If a priority of “-4” does not improve things, then it is likely something else needs to be considered and that competition for resources is not the cause of the latency.

Regarding affinity keep in mind that normally it is the scheduler which determines what runs and when something runs on any give core. An RT kernel is not magic, and what really changes is the scheduler algorithm. The rules for what to send where can be made more absolute with an RT scheduler, but someone has to actually tune that for it to matter. A normal scheduler will succumb more to “pressure” from a low priority process which has been delayed longer and longer such that it eventually runs even if it is a lower priority. RT can give hard assurances, but that is only in software, it doesn’t change the hardware unless we are talking about an ARM Cortex-R core.

Your normal ARM Cortex-A core (or a desktop CPU from Intel or AMD) has cache, probably at multiple levels. Your scheduler is aware of this. Sometimes a process will have many threads, and if you have 8 or 12 or more cores, it might look like it would run faster by putting each thread on a different core, but this is rarely the actual case; it depends on data. Every time you migrate from one core to the next you will get a cache miss. Any time you stay on the same core you might (probably in a lot of cases) get a cache hit. Cache misses cost a lot of time. Typically the scheduler will try to run threads of a process on a single core to try and get cache hits. If you know this is not a problem, then putting a process or thread on a new core might help. That’s only true though if the scheduler is not forced for some reason to migrate back to the original core.

CPU0 is a special core on Jetsons. This core has the wiring for IRQ of any hardware (hardware IRQs need wiring to a core to send an interrupt to that core; desktop PCs have either a IO-APIC or equivalent programmable interrupt controller to change where a hard IRQ routes to). There is a lot of hardware which has no routing to any other core, and you could set affinity to one of those cores, but if a hard IRQ is observed, then the scheduler must migrate back to that original core.

The file “/proc/interrupts” is a list of current hardware IRQ statistics. It isn’t a real file, it lives in RAM and is updated in real time. You could look at it with something like “less /proc/interrupts” and see a snapshot. You will notice hard IRQs for timers on all cores. Every core always has a timer, this allows polling on that core. However, you’ll notice an overwhelming number of hard IRQs on CPU0. To some extent this is also true on a desktop PC, but a PC with the IO-APIC (or equivalent) has many more tricks up its sleeve that a Jetson won’t have when it comes to direct hardware interraction. The desktop PC is also trying to maximize use of cache, so it isn’t entirely different, it is the same scheduler.

If you’ve picked up the RT kernel, then you have a shiny new scheduler! If the hardware allows it, then the RT scheduler allows you to do things to more or less guarantee some events. In a Cortex-A core this is not going to be a guarantee because of cache hits and misses, but it does add some control by tuning priorities (this is not automatic). The PC has better use of the RT extension when you have an IO-APIC or equivalent as it won’t migrate back to the original core when the new core lacks hardware routing to the new core.


It is indeed a possibility that your processes pulling data out of a kworker thread are bottlenecking, but I would definitely not consider a simple increase of priority in your program is the issue, there is an enormous chain of priorities. Is the kworker thread your thread, or is it part of the system? Solving this can differ depending on the answer to that. Profiling can give you better answers, but in kernel space that’s a far more difficult task then in user space. Looking at networking, do you have a way to tell if the data being fed is not being delayed by latency?

Hi @linuxdev

Wow! Thanks for the detailed response. Going to have to take some time over the weekend to digest this one.

To answer a few of your questions:

  1. For this test we are running on an Orin AGX connected via a single ethernet cable (no routers/switches) to another compute box. On the compute box we are running a simple TCP server program, that just sends over a counter at 200hz (see code below).
  2. In this program we include the counter so we can see if packets are being dropped.
  3. On the client side we simply calculate latency as delta from previous message (using CLOCK_MONOTONIC). When under light load, the latency we calculate is very reasonable (~5ms as expected). When the system is under load, we see bursty/high latency with values of upwards of ~1000ms (see plot above). Again, this client program is running with high SCHED_FIFO priority.
  4. We are not spawning any of our own kworker threads. We can see in top -H that there a bunch of kworker threads with priority 20, but no idea what they correspond to or what they are doing.

Thanks again for all the deep insight here. Very much appreciated.

#!/usr/bin/env python3
import socket
import time
import signal
import sys

HOST = "0.0.0.0"
PORT = 11122
PERIOD_S = 1.0 / 200  # 200 Hz

running = True

def handle_sigint(signum, frame):
    global running
    running = False
    print("\nStopping server...")

def serve_one(conn, addr, start_counter=0):
    """Serve a single client until it disconnects or Ctrl-C is pressed.
       Returns the last counter value sent (so we can continue after reconnect)."""
    print(f"Client connected from {addr}")
    conn.setblocking(True)
    counter = start_counter
    next_time = time.perf_counter()

    try:
        while running:
            # Send just the counter for easy parsing by the client
            msg = f"{counter}\n".encode("utf-8")
            conn.sendall(msg)
            counter += 1

            next_time += PERIOD_S
            now = time.perf_counter()
            sleep_dur = next_time - now
            if sleep_dur > 0:
                time.sleep(sleep_dur)
            else:
                # If we ran late, resync schedule
                next_time = now
    except (BrokenPipeError, ConnectionResetError):
        print("Client disconnected.")
    finally:
        try:
            conn.shutdown(socket.SHUT_RDWR)
        except Exception:
            pass
        conn.close()
    return counter

def main():
    signal.signal(signal.SIGINT, handle_sigint)

    # Reuse last counter value across reconnects
    rolling_counter = 0

    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        s.bind((HOST, PORT))
        s.listen(1)
        print(f"Server listening on {HOST}:{PORT}")

        while running:
            try:
                s.settimeout(1.0)
                conn, addr = s.accept()
            except socket.timeout:
                continue
            except OSError as e:
                if running:
                    print(f"Accept error: {e}")
                break

            rolling_counter = serve_one(conn, addr, rolling_counter)

    print("Server shut down.")

if __name__ == "__main__":
    main()