Serial port less reliable after upgrade to 35.1, continuation

Hello @linuxdev,

I have just performed the same tests on 5.0.2 and 4.6.2.
In the enclosed zip file SerialTest.zip (31.3 KB) is the test program source code and binary, the test script and 4 folders containing results.
I am using the same setup for all tests: same NX and NX dev kit carrier board, ttyTHS0 TX and RX binded together.
The machine is rebooted before each test.

I simplified the test program so that it is as simple as possible.
We initialize ttyTHS0 @ 460800 in blocking mode, then we launch a write thread and a read thread.
The write threads writes the same string at approximately 200Hz, a 60-character string 0x1 0x2 0x2 … 0x2 0x3
The read thread parses the serial data character by character, waits for the expected START_BYTE character 0x1 then reads until it finds another START_BYTE, or until we have read 60 characters. We then check if the received string corresponds to what we expect.
The main program also dumps some serial statistics every 10 seconds.

To perform a test, launching the test.sh script performs the following:
sudo jetson_clocks
sudo nvpmodel -m 8
Dump /etc/nv_tegra_release to folder
Launch the test program (redirecting output to folder), set with chrt FIFO 99 to the associated PID, move ttyTHS0 interrupts to CPU id 1
Waits 10 minutes
Dump /proc/interrupts to folder

Tests on Jetpack 4.6.2:
462-idle: only the test script was launched
462-clocks: the test script was launched, and we run as root in another terminal 10 seconds later: watch -n 0.1 “setsid jetson_clocks --show”

Tests on Jetpack 5.0.2:
502-idle: only the test script was launched
502-clocks: the test script was launched, and we run as root in another terminal 10 seconds later: watch -n 0.1 “setsid jetson_clocks --show”

As the content of out.txt states, we have 0 total errors in 462-idle 462-clocks and 502-idle (“0 total errors”).
Regarding 502-clocks, we have a total of 388 packet errors (0.32% of all packets), the errors are distributed quite evenly during the acquisition.

I also tried in 462-clocks mode to launch 6 terminals instead of 1 performing watch -n 0.1 “setsid jetson_clocks --show”, total CPU usage was ~20%, still no errors.
It is very puzzling as to how we can have such a difference!
Launching the jetson_clocks --show script multiple times is just an example to generate CPU activity, any other CPU stimulation will work too.

This test program was stripped down from a more efficient version where the serial port is polled using select() in non blocking mode, on which of course we have the same problem.
I also tried to compile and use a realtime version of the kernel on 5.0.2, I did not get any improvements.
I did the tests on the same NX to make sure we had an identical setup, but I do have multiple NX so if more tests are advised, I can easily switch Jetpack versions from now on, or recompile any kernel with different settings.

Thank you very much for your help!

I’ll suggest one change: Run “sudo nvpmodel -m 8before running “sudo jetson_clocks”. The jetson_clocks command will bind performance to the max of a given mode, but unless the mode itself is already allowing a max setting, then you might not get what you want. So set the model first, and then max out clocks.

For reference to others reading this, in the downloaded link to SerialTest.zip, it looks like the test output from @alban.deruaz’s end is (nice organization btw):

  • R4.6.2:
    462-clocks/
    462-idle
  • R5.0.2:
    502-clocks
    502-idle

I noticed in the source it states “Opening /dev/ttyTHS0@460800”. Then it opens “/dev/ttyTHS0”. can you verify in the L4T R5.0.2 that “/dev/ttyTHS0” has group “dialout” and not group “tty”?

This looks like a really good method to reproduce the issue. NVIDIA should be able to reproduce the issue with this on a JP 5 install. I’ll try to make this work as well and comment, but I have some disk space issues on the host PC, and so it may take some time for me to work on this. Feel free though to bump this thread.

TIP: Add this forum thread’s URL to a README in the SerialTest.zip.

Meanwhile, hopefully NVIDIA will see this since it is now possible to reproduce the issue.

Hi,

Thanks for the info regarding the order of nvpmodel and jetson_clocks.

$ ls -l /dev/ttyTHS0
crw-rw---- 1 root dialout 238, 0 Apr 21 14:54 /dev/ttyTHS0

The group is dialout, the test user is also a member of dialout, allowing non root serial port operation.
I just updated the zip file with a README pointing to this thread’s URL.

hello alban.deruaz,

may I also know the detail steps to reproduce UART issue on developer kits?
could you please also share the hardware connections.

we’re setting up an environment with 40-pin expansion header, which is pin-8/10 for UART-1_TX/RX. we’re also changing the test script for using 115200 baud-rate as its default settings.
however, we can only see Launching acquisition by running the test script, it looks no packets on receiver side.

Hello,
On a developper kit, I just binded pins 8 and pin 10 together and ran the program.
The serial port is opened in blocking mode, meaning if it never receives anything, you won’t get any output after the ‘Launching acquisition’ print.

I guess you have replaced serialtest.cpp:24 B460800 with B115200 and recompiled the program to change the baudrate? Perhaps dmesg provides an error associated with the serial port just atter the open?
My entire tests were done with the device tree modification below:

serial@3100000
nvidia,adjust-baud-rates = <0 1000000 100>;

I’ve been delayed by my system failing and needing a motherboard replacement, so I’m kind of out of the loop on things. However, maybe NVIDIA knows of a way to dump the UART’s CTS/RTS configuration registers to see if it really is blocking due to a setting versus due to something else. One might normally use an ioctl call through something like stty to dump settings, but each ioctl is custom to the driver and I think the NVIDIA drivers probably don’t dump everything this way which an actual 16550A UART driver dump would show. Tracing the problem to a feature setting versus a failure to act based on a correct setting would help.

Hello!
Is there any news? Does nvidia have a public issue tracker somewhere or do they fix issues internally without public visibility?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.