Serial communication issue - "Got overrun errors"

Hi Mark,

I modified loopback test app to use 2STOP bits.
With tx/rx at 115200 baud, test ran (out of sync as i see error in rx side) for ~15 minutes.
With tx/rx at 3000000 baud, test is running (out of sync) for >10 minutes and continuing.

So I feel we are seeing somewhat different results here.

command line I am using:

./serialtest2S -v -r -d /dev/<b>ttyTHS2</b> -b 3000000
./serialtest2S -v -t -d /dev/<b>ttyS0</b> -b 3000000

Hi Galactus,

I just did a clean install using JetPack 2.2, 64-bit, no updates, no paramater changes and ran only the following commands using the loopback test with 2 stopbits and the camera removed to result in instant crash of the OS:

sudo initctl stop ttyS0
./sertest -v -r -d /dev/ttyTHS2 -b 3000000

I didn’t even get the chance to start the transmit test.

What do you think might be different between our setups?

-Mark

P.S. I am now 3 for 3 repeating that sequence and crashing the OS.

Hi Mark,

My setup,

  • no debug connector at J10
  • I left camera module connected as I understand that only when camera is active it will enable mux to divert UART from J17 to cam. I can remove it too and run test again.
  • Third difference is that I removed "console=tty0 console=ttyS0,115200n8 debug_uartport=lsport,0 earlyprintk=uart8250-32bit,0x70006000 " from /boot/extlinux/extlinux.conf after board boots and rebooted the board. I didn’t do “sudo initctl stop ttyS0”

Note: I dont see any output when I do dmesg.

Can you please try removing console command line parameters and see if it crashes immediately?
I will post my results after removing camera module.

Hi Galactus,

I was up to 5 for 5 OS crashes using the steps above, then got 3 for 3 crashes just by starting the receive test, (not stopping ttyS0 after reboot).

Next, I changed the parameters as you indicated and rebooted. Here are the results:

  1. With 2 stop bits I am not getting good data through, however with 1 stop bit I get perfect tests and…

  2. I have not yet been able to crash the OS even after stopping/starting the tests many times. :) (Yay!)

Can you help me to understand what removing those parameters has accomplished?

Thank you very much for understanding this well enough to have apparently fixed it!

I would like to run more testing tomorrow with my team to verify, but this is very encouraging!

-Mark

(UPDATE: The test really hangs very badly at speeds above 1Mb/s, but it does not crash the OS… Very happy about that!)

Good to hear that removing console parameters to use ttyS0 helped.

I suspect issue is not with UART driver but console driver.
Due to overrun there are two many messages flooding the console buffer and IIRC then flush mechanism works in spinlock and that is holding it too long to trigger watchdog or its waiting for ttyS0 to be free so that it can dump messages.
So both UART usage and console driver are someohow messing up.

After removing ttyS0 for console, only test app is using ttyS0 and works fine.

Also if there is proper delay between the tx side (like after 16 bytes, delay of 10ms like this) then it will work with flow control.
We have fifo depth of 32 bytes and so there should be proper delay after 32 bytes (delay should be as much that rx thread can read the data from fifo).
If you dont have the delay then it will nto be possible to stop this issue.

So can we say that this issue is resolved.

Thank you Galactus!

From everything I can see, those parameter changes resolve the issue. Thank you!

Do you think this is something that might be incorporated as a permanent fix in future L4T’s, or do you think will we always need to update the extlinux.conf contents on our machines?

Please let me do more thorough testing over the next day, but everything is looking great so far after many stops/starts of the test, a range of speeds up to 1Mb/s, and over 1Gbyte transferred without error on the current test run. (I will update again late tomorrow after more testing.)

Thank you again for your insight and help!

-Mark

Hi Galactus, Linuxdev, and Dusty,

I had good luck for a whole day with the patch Galactus offered, but we had an OS crash again today just by opening up a serial port and writing to and receiving from it.

This is a bad issue,… it really needs to be resolved for the sake of your product and your reputation. UART serial comms should be bulletproof to the OS… period.

I can’t spend anymore of my time trying to beta test this for you, we just need it to work.

I am going to have to leave this in your capable hands for the next release and hope you work hard to keep your customers happy.

-Mark

P.S. 32 bytes plus a 10ms delay does not equal 12Mb/s data rates. Please revise your specification datasheets asap, (JetsonTX1_Module_DataSheet_DS07224010v091 Section 3.9, page 28).

Hi Mark,

That’s bad that issue was seen again.

As Jetson-TX1 is using serial@70006000/ttyS0 for debug console and other UART instances for high speed transfers, so I guess Jetson-TX1 devkit will continue to ship with debug console options.

Please help by providing more information.

1/ From your first comment, I understand that you have a custom carrier board and serial@70006040/ttyTHS1 and serial@70006200/ttyTHS2 can be used for serial transfers. Can you please help us understand if you see same issue while using ttyTHS1 and ttyTHS2 for serial communication/test?

2/ For loopback test between ttyS0 and ttyTHS2, can you please put system to test with below changes and still see the issue.

Along with changes mentioned in comment #39 https://devtalk.nvidia.com/default/topic/946770/jetson-tx1/serial-communication-issue-quot-got-overrun-errors-quot-/post/4927601/#4927601

Can you please also try disabling below configs in file : arch/arm64/configs/tegra21_defconfig

# CONFIG_VT is not set
# CONFIG_SERIAL_8250_CONSOLE is not set
# CONFIG_PSTORE is not set

Hi Galactus,

Thanks for checking in.

Going back to what I wrote in an earlier post, all of the issues we have reported have been demonstrated on the TX1 Developer Kit Carrier Board. (Please ignore our custom board so we focus here on what we can demonstrate on the TX1 Developer Kit Carrier Board.)

Please follow these instructions to crash your OS, once you have replicated the crash I will try your instructions above:

  1. Install a clean, standard installation using JetPack 2.2 or 2.2.1

  2. Connect ttyTHS2 receive with ttyS0 transmit in the following manner on your Developer Kit Carrier Board:

  • Header J17: pin 4 connected to J21: pin 8
  1. Run a receive instance of the serial test on ttyTHS2 at 115200 baud:
./sertest -v -r -d /dev/ttyTHS2 -b 115200

You should be able to see that you are receiving the following data from the serial console:

  • "RTNETLINK answers: Network is unreachable"

Now stop that test instance (CTRL-C).

  1. Run a receive instance of the serial test on ttyTHS2 at the WRONG baud rate to receive what the serial console is still sending:
./sertest -v -r -d /dev/ttyTHS2 -b 1000000

Your OS will crash and automatically reboot after 30 seconds.

A mis-matched baud rate on the receive side shouldn’t crash the OS under any circumstance.

Please let me know when you have observed the same. Thanks!

-Mark

Hi Galactus, Dusty_NV, and Linuxdev,

Have any of you had luck duplicating an OS crash by following my instructions above? Any new insights or developments?

Thanks for your help with this issue!

-Mark

I haven’t had a chance to test it yet, likely tomorrow.

From this post:
[url]https://devtalk.nvidia.com/default/topic/946770/jetson-tx1/serial-communication-issue-quot-got-overrun-errors-quot-/post/4934436/#4934436[/url]
…before connecting ttyTHS2 receive with ttyS0 transmit, what steps were used to disable serial console?

Hi Linuxdev,

To answer your question… none, just follow my instructions precisely.

Point being, you’re going to need something sending data into ttyTHS2 to crash the OS… If you disable ttyS0 you’re going to have to use some other source of data (provided by yourself) to be received by ttyTHS2. If you prefer to hook up some other source, go for it… I was just using ttyS0 as a convenient source of consistent data for ttyTHS2 to listen to.

For instance, you can use the loopback test and a USB to serial converter to eliminate the dependency on ttySO and all potential interactions with ttySO. I am successful in crashing the OS with a hardware configuration in that manner as well. I only suggest you use ttyS0 as a source of data so that your test setup only requires a single jumper wire to accomplish the test.

Let me know if you’d like a suggestion for USB to serial converters we’ve used, the FTDI chip converters seem to play the best with the TX1 drivers, Prolific converter drivers on the TX1 frequently crash, (the driver, not the OS).

Let me know if the exercise is making sense. (The purpose is to receive data at ttyTHS2 and see that it is enough to crash the OS.)

-Mark

It sounds like you were able to keep a serial console connected while bridging J21 pin-8 to J17 pin-4. I bridged this and could verify ttyTHS2 was receiving from ttyS0 based on some text normally sent to serial console. After that, I had no way to cause the serial console to generate more data since serial port was removed in order to add the jumper. When using ttyS0 as a data source, how were you causing more data to be generated?

As long as I don’t use it for serial console I do have an FTDI USB serial UART which works at 3.3V levels (this is what I normally use on the J21 serial console). I could just as easily plug this into the J17 connector and generate data through the USB psuedo terminal (such as via terminal program or the serial test program). Would this be preferable?

Hi Linuxdev,

You don’t need to do anything to ttyS0, it will keep sending the same data as long as you don’t interfere with it. All you have to do is restart the receive program at a different baud rate. (I feel like we are speaking past each other, not to each other… Please help me understand if this doesn’t make sense.)

Please read my instructions carefully:

Start a receive to verify that it’s receiving data, then restart it at 1000000b/s and watch your OS crash. The problem doesn’t appear to have much to do with the transmit instance, it appears to be with the receive instance.

Please concentrate on the fact that ttyTHS2 must be receiving some form of data, and you must ask it for an incorrect baud rate, and then at some point your OS will crash. Supply it with some data and you’ll see the whole point to this entire post. It just so happens that if you supply it with data from ttyS0 it happens the fastest and most reliably.

If you have an FTDI USB serial converter, please use it to supply data to ttyTSH2. Here are instructions, as long as you execute them LINE FOR LINE, you should see the same things I see:

  1. Install a clean, standard installation using JetPack 2.2.1

  2. Connect something, (anything), that can send serial data to ttyTHS2:

  3. Get ttyTHS2 to receive your data however you wish… we provided you with a good test, but if you prefer to use your own, please supply it for us to use as well. If you use ours, you can run it as such:

    ./sertest -v -r -d /dev/ttyTHS2 -b 1000000

You should be able to see that if you are receiving some kind of data, your system will now become unstable and likely crash at some point… THIS IS THE WHOLE PROBLEM… RECEIVING UART DATA SHOULD NOT MAKE THE SYSTEM UNSTABLE!!!

Please let me know if this really, honestly, isn’t clear for any reason.

-Mark

P.S. Please, if this doesn’t make sense, ask Dusty_NV for my personal phone number which he has, and call me at any hour of the day or night so we can talk. I really want to be done with this problem, in the worst way. :)

P.P.S. here’s my website, please see real evidence that I’ve gotten other things to work in the past and I am not a total idiot. [url]https://greypointcorp.com/[/url]

Hi Mark,

I am trying to repro the issue as you have suggested and actively working on it.

  • long duration transfers.
  • small duration transfer and reboot in loop.
  • small duration transfer with different baud rates.

I am able to repro the hang and error messages (without removing ttyS0 as debug console)

I will update you soon on this.

Galactus,

Thanks!, really, thanks! :)

Just FYI, I can lock the system with serial port, but I did not get a reboot. I ended up using the magic-sysrq keys to try to generate a kernel OOPS, but unfortunately this never gets logged in that state. I think the OOPS simply gets sent to serial console…which is not connected. This is on R24.1 64-bit, I’ll do some more testing, but without serial console and without a JTAG debugger and no way to record the OOPS, I’m not sure if it’ll help much.

Hi Linuxdev, Galactus, and Dusty_NV,

I’ve gotten a chance to speak directly with the NVIDIA developer team directly responsible for this driver and they have identified the issue, replicated it, created a few patches for us to try out, and this is such great news I had to post it!!!

Thanks to everyone involved who has worked on this, it is a really big deal in the universe… (at least to me. :) )

-Mark

I haven’t seen the patches hit http://nv-tegra.nvidia.com/gitweb/?p=linux-3.10.git, any chance you have more info about when we’ll be able to test these fixes?