Serial port less reliable after upgrade to 35.1, continuation

Hello, this is the continuation of the now-closed topic
https://forums.developer.nvidia.com/t/serial-port-less-reliable-after-upgrade-to-35-1/232396

I have setup the communication with two stop bits instead of one (on jetson side and on sensor side) as @linuxdev suggested, so now we are in 460800 8N2 mode. Unfortunately I have the same amount of packet crc errors than I had with 1 stop bit, on jetpack 5.0.2 l4t 35.1
The port does not matter, I tried on ttyTHS0 and ttyTHS1 (with nvgetty disabled), the amount of errors does not significantly change. Nothing of interest appears with dmesg | grep serial.

Completely reverting to jetpack 4.6.2, l4t 32.7.2 works nicely with 1 stop bits or 2 stop bits, there is no packet crc error for hours.

It seems some serial-related configuration changed, how can I investigate further?
Thank you very muich for your help!

During my investigations I noticed that calling jetson_clocks --show increased the packet error amount.
EDIT: this is not related, it just increased the CPU load, see below

hello alban.deruaz,

may I know which serial port you’re used, please looking for kernel logs with… $ dmesg | grep THS
besides, please also share the device tree changes, or code snippet you’ve done for reference.
thanks

Hello JerryChang,

Thanks for your reply.
I am listening on ttyTHS0:

[    6.718686] 3100000.serial: ttyTHS0 at MMIO 0x3100000 (irq = 31, base_baud = 0) is a TEGRA_UART
[    6.735198] 3110000.serial: ttyTHS1 at MMIO 0x3110000 (irq = 32, base_baud = 0) is a TEGRA_UART
[    6.748436] 3140000.serial: ttyTHS4 at MMIO 0x3140000 (irq = 33, base_baud = 0) is a TEGRA_UART

Only device tree change I did from the stock device tree is setting the following line for the block serial@3100000:

nvidia,adjust-baud-rates = <0 1000000 100>;

Which allows setting a baudrate higher that 115200 on this serial port.

The acquisition process is a simple c++ program performing read calls in an infinite loop, which worked wonders in jetpack 4.6.2.
Thanks for your help!

I pushed further the investigations,
When the acquisition is running with nothing else (total cpu load around 1%), errors are near 0.
When other programs are running (total cpu load around 20%), errors start rising.
I tried moving serial@3100000 interrupts to another cpu and bumping the acquisition process PID with chrt -f 99 but it does not fix things.

In an Intel format CPU you have hardware known as an I/O APIC (Asynchronous Programmable Interrupt Controller…AMD has something else in its hardware layout). That chip is used to program which CPU core a hardware interrupt is routed to. Jetsons (and many embedded systems) do not have this. Hardware interrupts can only be sent to a core where there is physical wiring to the core. Since only CPU0 is wired for most hardware interrupts there is no ability to migrate to any other core (you can try, but it’ll migrate back to CPU0).

Take a look at the output of “cat /proc/interrupts” (which is hardware IRQ). Look at the number of interrupts on any core, and note that CPU0 takes by far more hardware IRQs than any other core even if CPUs are all loaded near 100% (the file will update and increment each time you “cat /proc/interrupts”). Software IRQs can run on any core, and every core has access to the memory controller and its own timers, but you’ll find most hardware communications, e.g., ethernet, will stick to CPU0. When you have more hardware interrupts than CPU0 can service you get “interrupt starvation”. Are you running the system at its max performance mode? If not, then you could speed up CPU0 by doing this.

Alternatively, there are software IRQs (indicated from ksoftirqd) or user space processes which could be migrated away from CPU0 to another CPU, and thus give CPU0 a lighter load for the hardware IRQs which cannot migrate.

hello alban.deruaz,

please also refer to Power Mode Controls session to change the power mode with the nvpmodel command. (or, using the nvpmodel GUI front end. ) to toggle system performance as MaxN, (i.e. mode-0) for confirmation.

Hello, thank you very much for your help,

nvpmodel -q shows I am running in MODE_20W_6CORE
Also, I have confirmed by running cat /proc/interrupts that serial@3100000 interrupts are successfully moved to another CPU than 0: the associated interrupt counter is increasing in the cpu-2 column.

I don’t really know what more to test, the serial port is opened in blocking mode, and I read it in a loop without sleeps, and the serial port’s DMA should be able to buffer the gap in any reading latency… Is there maybe a kernel flag to have verbose information about serial port activity?

how about changing the power mode to MaxN, $ sudo /usr/sbin/nvpmodel -m 0 for testing.

Hello, I tried with MaxN, it doesn’t really change much regarding the amount of packet errors. I checked I only had 2 cpus left in /proc/interrupts, tried moving the serial irq to the other cpu along the acquisition thread, to no avail.

Can you post a copy of your “/etc/nvpmodel.conf”? Or browse that and examine the “POWER_MODEL ID=”.

Also, this is just subjective, but if you run htop (“sudo apt-get install htop”, or any similar program), and the system is under load, do all of the CPU cores show activity?

What do you see from “cat /proc/cmdline”?

Hello @linuxdev ,
Here is the information asked:

$ cat /etc/nvpmodel.conf | grep "PM_CONFIG DEFAULT"
< PM_CONFIG DEFAULT=8 >

$ cat /etc/nvpmodel.conf | grep "POWER_MODEL ID"
# < POWER_MODEL ID=id_num NAME=mode_name >
< POWER_MODEL ID=0 NAME=MODE_15W_2CORE >
< POWER_MODEL ID=1 NAME=MODE_15W_4CORE >
< POWER_MODEL ID=2 NAME=MODE_15W_6CORE >
< POWER_MODEL ID=3 NAME=MODE_10W_2CORE >
< POWER_MODEL ID=4 NAME=MODE_10W_4CORE >
< POWER_MODEL ID=5 NAME=MODE_10W_DESKTOP >
< POWER_MODEL ID=6 NAME=MODE_20W_2CORE >
< POWER_MODEL ID=7 NAME=MODE_20W_4CORE >
< POWER_MODEL ID=8 NAME=MODE_20W_6CORE >



$ cat /proc/cmdline
root=/dev/mmcblk0p1 rw rootwait rootfstype=ext4 console=ttyTCU0,115200n8 console=tty0 fbcon=map:0 net.ifnames=0 nv-auto-config

htop shows that all 6 cpus are running, mean during 60 seconds is:
CPU 0 19.4%
CPU 1 20.4%
CPU 2 22.6%
CPU 3 18.1%
CPU 4 17.6%
CPU 5 23.4%

Thank you very much for your help!

So it looks like the model needed for max changed. Try:

sudo nvpmodel -m 8
sudo jetson_clocks

Does the port become more reliable when maxed out like that?

Hello @linuxdev ,
Before launching the program, I always perform a sudo jetson_clocks to max out performance.
nvpmodel -q already shows I am running in mode 8 (as you can see in nvpmodel.conf: PM_CONFIG DEFAULT=8)

I also tried not running jetson_clocks at all before launching the app, but it does not really change the serial port reliability.

I don’t know then what the cause of the reliability change between versions would be. Obviously the hardware itself is identical between the two releases, and the power mode is the same. “Sounds like software”, but one last question: Is the UART talking to the same external device? Could you test in loopback mode (TX and RX wired directly together with the shortest possible length) to find out if it has anything to do with the cable+remote end?

Hello @linuxdev ,
So I binded RX and TX together, and wrote a simple C++ program to test the serial port.
First, I open /dev/ttyTHS0 with 8N1 460800 in non blocking mode (I use tcisetraw to set the serial port settings to raw mode).
Then I create two threads:
First thread sends the same 60 bytes at 200Hz,
Second thread reads the data and checks the received data corresponds to the sent data

When only this program is ran, reception is near-perfect (I get 1 mismatch error every 5 minutes or so)
When I also run in another terminal as root: watch -n 0.1 "setsid jetson_clocks --show", totaling 5% CPU load, I get from 1 to 5 mismatch errors per second. Sometimes a character is missing, sometimes a character is corrupted.

Thanks for your help!

a.out (61.2 KB)
Here is the sample acquisition program to reproduce the issue, all you need to do is:
run the program in one terminal (making sure /dev/ttyTHS0 is accessible as your user, or run as root).
Acquisition stats are written to console every 10 seconds.

→ wait for a bit to see that acquisition is OK (0 crc errors)
Run as root in another terminalwatch -n 0.1 "setsid jetson_clocks --show"
→ Notice that packet crc errors start to rise.

If you run at 115200 8N1, does it have the same errors? I ask because the clock is known to have a correct value (within tolerance) at that speed, but at higher speeds the clock tolerance may be off.

Also, I don’t know if the threaded program might have anything to do with it. It shouldn’t, but it might due to having the same CPU core IRQ used for read and write. The number of interrupts would double, but there would be no guarantee of them being serviced in alternating order (there might be multiple write IRQs before a read IRQ, or the other way around; logically, you want exactly one write before one read if you are testing the UART and avoiding effects from other parts of the system during test).

The first version of the program was a non-threaded one, which would, in an infinite loop, publish the data if necessary then read. It showed the same error amount.

I launched both programs on jetpack 4.6.2 for 10 minutes, no serial errors were detected even when under load.

I tried to run at 115200, I get less errors (down to 1 crc error every minute, which is a nice improvement). However if I increase the cpu load by launching a total of 3 terminals running the watch command above, causing15% cpu load, errors start to rise again.

Since for this loopback test we are using the same serial device, I would believe RX and TX are using the same clock, meaning any clock deviation would be common for both pins?

I tested on two different NX modules, I obtain the same behaviour, I guess anyone should be able to reproduce the issue. I can post the source code of the program if it helps.

Thank you very much for your help!

I can’t confirm it, but you are probably correct about the clock deviation not mattering in loopback. You mentioned the 200 Hz send rate. When sending at 200 Hz does it wait until receive has completed before sending more data? It does sound like a software error, but it isn’t one that will be easy to find, and anything known to trigger this might matter. We know it only matters on JetPack 5.x, but it is good to compare and verify each failure as not failing on 4.x as well (what you’ve listed is useful, keep doing the comparisons of 4.x and 5.x).

After you’ve run this for a 10 minutes on both 4.x and 5.x, can you post a copy of “/proc/interrupts”? I just want to see if they are using the same core, and what the statistics might show.

I’m hoping you have both a 4.x and 5.x board rather than having to flash fresh each time. We could probably skip some of the 4.x steps if you have to reflash each time.

So yes, being able to reproduce it matters, including exact releases used (actual flashed content version is found via “head -n 1 /etc/nv_tegra_release”).