Serial communication issue - "Got overrun errors"

Was testing only from serial ports of the PCIe serial card (in the case of the built-in serial port testing)? I just want to verify J17 was not involved since this is linked to the camera module.

From the 16550 serial port card failure, is there any kind of log or OOPS message you can post? It would help to see how much of this error cross sections with the onboard serial port error.

EDIT: I just went back through some of the thread, and have another observation. This is a custom board, so I assume there is no camera module attached. However, there would still be firmware or other setup to link what was J17 to the camera module unless you’ve modified this. Are you sure there are no firmware or other content which might interfere with using the serial UARTs for custom purposes?

If you want to browse a dtb file from “/boot”, dtc is built in kernel source at “scripts/dtc/dtc”, or else can also be installed on the host. This would reverse a dtb file and provide a human readable dts (though any original source code comments or naming would not be present):

dtc -I dtb -O dts -o /tmp/extracted.dts /boot/the_firmware_in_extlinux.dtb

I installed the PCIe serial card and I couldn’t get it to work. The lspci command didn’t show anything on the PCIe bus, so I have not been able to test with the PCIe serial ports yet. I think it’s the same problem mentioned here:

https://devtalk.nvidia.com/default/topic/936285/jetson-tx1/tx1-serial-port-configuration/

It’s true we have a custom carrier board for the TX1, but as soon as we ran into problems, we switched back to the nVidia devlopment board to eliminate any hardware issues. Our testing is currently with the dev boards only. We have removed the camera module, and the problem has not gone away. I walked through my kernel config and tried to remove any camera related drivers, but now my kernel won’t build so I have to revisit that myself and figure out what I can and cannot remove.

Thanks for the instructions to reverse-compile the dtb file, I may use that in the future. However, our dtb file is currently stock from the Jetpack 2.2 install.

-Dennis

The stock dtb will likely have some config for the camera, so this is why I mention it.

In terms of kernel compile, I’ve found I can build modules with other Linaro compilers, but if building the image itself, I had to use the v4.8 compilers that come with the driver package documentation (it’s in the “baggage” subdirectory).

I don’t have the particular PCIe serial port card, so I can’t test, but you might want to post the model and such so there is an online reference…someone else may be able to obtain one for testing.

Hi Linuxdev,

Just to clarify from your #21 post, we have accomplished all of these failures on the stock eval board, e.g. the one you can buy on amazon, the one you sell, so that you could also see and verify the same exact failures we are seeing…https://www.amazon.com/gp/product/B017NWO6LG/ref=oh_aui_detailpage_o00_s00?ie=UTF8&psc=1 everything stock, nothing custom.

To be absolutely clear, we have accomplished every one of the tests we have declared as failing on the TX1 development board as sold by NVIDIA, stock, with the images you provide installed by the software you provide… this is completely without modification, reproducing all of these failures with only your hardware and your software… there is nothing magic, just read what I’ve written and try it for yourself, please.

-Mark

P.S. We’ve also used over 5 different hardware units in all of this testing, so it’s not related to a specific board, it’s reproducible across multiple stock units, 32 bit and 64 bit, and it really seems to be very low level driver related. i.e., we’d really like it if someone from NVIDIA could start looking into the driver with us on this, please.

P.P.S. I’ve added a profile picture to help me plead my case that I paid real money for your product and would really like support for the “as purchased” product to perform “as advertised”.

One last thing I’m going to do tonight, which is mostly for a laugh, although not really laughable, but still amusing… and since I’m on the subject of profile pictures, Dennis was the one who had to point out to me the cleverness of your profile (Jet + Sun + Linux Tux + Muscles = Mighty Jetson Linux Developer), so… please show us your stuff and help us fix this no good, really bad, horrible, (never ending) day. https://en.wikipedia.org/wiki/Alexander_and_the_Terrible,_Horrible,_No_Good,_Very_Bad_Day :)

Thank you again for all of your help… If we knew the answer we would post it, this just is a really stinky problem to deal with on every front. :} We appreciate your support.

-Mark

For reference, here is the PCIe serial port card I am having zero success with:

SIIG Cyber 2S1P PCIe
http://www.siig.com/it-products/serial-parallel/combo/pcie/cyber-2s1p-pcie.html

Has anyone used a different card and been successful? I’d be willing to buy a different one if I knew it worked.

-Dennis

I’m trying to figure out a way to test something which I can debug without a JTAG debugger. I think I have a way, but it may take a day to actually see if it helps.

There was another post here which may be relevant to the PCIe serial port card not showing up…I had forgotten all about spread spectrum options on PCIe, but it makes sense that this is the issue for cards which can’t even be seen. Normally signal quality issues only apply to data lanes, but the control signals used for enumeration should always work with their slow speeds…when the control signal does not work, something is really wrong, but I’m fairly confident under the circumstances that there must be a software issue…a PCIe card not responding to spread spectrum would probably just not even show up. Here’s the post which reminded me of this (I’ll have to dig further on this I don’t know details yet):
https://devtalk.nvidia.com/default/topic/936883/jetson-tx1/nothing-showing-up-with-lspci-v/post/4923297/#4923297

PS: It’s amusing someone actually noticed my profile pic :P I also needed something unique so I could quickly sort my posts from others without reading text to slow me down.

Hi Guys,

How many stop bit are you setting?
Tegra TX1 uart receiver has low baud tolerance in 1-stop bit mode. It may lose sync between TX1 receiver and the external transmitter resulting in data errors or corruptions.
In this case, we must use 2-stop bit to fix this issue, as the extra stop bit allows the TX1 receiver logic to align properly with the external transmitter.

Hi nVConan,

Checkout the loopback test we posted previously (about line 200), https://devtalk.nvidia.com/default/topic/946770/jetson-tx1/serial-communication-issue-quot-got-overrun-errors-quot-/post/4922031/#4922031… only 1 stop bit. You can copy/paste what we posted, compile it and run it yourself by simply connecting the J17 and J21 pins as I described in the post previously. Connect J17 pin 4 to J21 Pin 8, and J17 pin 5 to J21 pin 10… (btw, I wish I didn’t know those pins by memory… :) ) and try it out for yourself as well.

We’ll see if we can get things working by changing our code and loopback test to 2 stop bits. Let us know if you can get the loopback test to work without errors with 2 stop bit as well if you get the chance. Thanks for the suggestion!

Also, late today Dennis was able to get the 32-bit 24.1 L4T to work by removing the camera module. I was able to crash the 64bit previously with the camera module removed (multiple times/repeatably), but I didn’t try the 32 bit with camera removed, (thanks for trying that Dennis). Tomorrow and Monday we are going to try to understand the differences better and retest both installations very carefully. I only mention this in case it helps you track down the root cause. We’ll report back on any findings as we make progress as well.

Thanks for your help.

-Mark

BTW, I realize I didn’t give step by step instructions for reproducing what we see using the loopback test.

  1. Copy our loopback test posted here and save it in a text file (sertest.c): https://devtalk.nvidia.com/default/topic/946770/jetson-tx1/serial-communication-issue-quot-got-overrun-errors-quot-/post/4922031/#4922031

  2. Compile the copied code using gcc:

gcc sertest.c -o sertest
  1. Connect J17 pin 4 to J21 pin 8, and J17 pin 5 to J21 pin 10.

  2. Run one instance of the executable with appropriate argument to “t” transmit:

./sertest -v -t -d /dev/ttyTSH1 -b 115200

and another instance to “r” receive,

./sertest -v -r -d /dev/ttyTHS2 -b 115200
  1. Enjoy the ensuing mayhem of transmitted and erroneously received data followed after an indeterminate time by the crash of the TX1 operating system! :)

Also, Dennis created the executable with help prompts as well, ./sertest -h

I’ll also give it a whirl with 2 stop bits asap when I am back at the hardware. Thanks for your help!

-Mark

Hi nVConan and Linuxdev,

I was able to run the loopback test on a 64 bit machine with 2 stopbits enabled and the camera module installed and also with the camera module removed. The result with 2 stopbits enabled is that instead of running for an indeterminate time, it crashes nearly instantaneously with 100% repeatability. (Please feel free to try it out for yourself! at least having instant results is satisfying, if not the results hoped for.)

Here is a new datapoint for you, though. With a 64 bit installation, the camera module removed, and the test using only 1 stop bit, the same result as previously discussed in this post remains. However, with a 32 bit L4T 24.1 installation and the camera module removed, Dennis was able to get a fully working loopback test with over 1,000,000 bytes transferred and no errors… Perhaps this is a lead you can use to investigate with.

I am going to run tests with L4T 23.1 installations now and will report back on those results when I’ve gotten them done.

-Mark

Hi Gang,

I have an update. It took me half the weekend fighting with a VirtualBox Ubuntu 14.04 virtual machine, but I was finally able to get through a successful installation of L4t 23.2 32-bit and was able to instantaneously crash the OS with the loopback test.

(Side note, Jetpack 2.2 /L4T 24.1 is a significant user experience improvement, so please pass along my compliments.)

It was an inordinate amount of fighting with IPv4 vs IPv6 network and other settings on the virtual machine to report this data back, but I am glad to be able to state that the issue absolutely and unequivocally exists within R23.2 32-bit as well.

-Mark

P.S. The faster you can get JetPack for Ubuntu 16.04, the better… please help us all by making this a priority! :)

Hi Mcsauder

dmesg | grep THS
[ 2.690754] 70006040.serial: ttyTHS1 at MMIO 0x70006040 (irq = 69) is a SERIAL_TEGRA
[ 2.700602] 70006200.serial: ttyTHS2 at MMIO 0x70006200 (irq = 78) is a SERIAL_TEGRA
[ 2.714015] 70006300.serial: ttyTHS3 at MMIO 0x70006300 (irq = 122) is a SERIAL_TEGRA

there are four UART in Tegra and can be mapped as below (P2795 schematic names used)
serial@70006000 –
serial@70006040 – THS1 – UART2
serial@70006200 – THS2 – UART3
serial@70006300 – THS3 – UART4

serial@70006200/THS1 goes to M.2 Key-E connector.
serial@70006000 is muxed between debug 60 pin connector and J21 and it is ttyS0 (console UART). Isn’t this is the UART you are using?

I tried to run loopback app provided by you in two ssh terminals.
./sertest -v -r -d /dev/ttyTHS2 -b 115200
or
./sertest -v -r -d /dev/ttyTHS1 -b 115200

waits forever for data.

So I used

./serialtest -v -r -d /dev/ttyTHS2 -b 115200
./serialtest -v -t -d /dev/ttyS0 -b 115200

and see messages as below:

rx side


Error - unexpected value: F, should be: U
Error - unexpected value: G, should be: V
rx: 3163000
Error - unexpected value: H, should be: W
Error - unexpected value: I, should be: X
Error - unexpected value: J, should be: Y

tx side


tx: 3292000
tx: 3293000
tx: 3294000
tx: 3295000
tx: 3296000
tx: 3297000
tx: 3298000

Is this expected output or my setup is still wrong?

If I kill the rx side and later tx side I do see flooding of messages and most probably watchdog timeout.
[ 2851.109865] serial-tegra 70006200.serial: Got overrun errors

If I kill tx and then rx then i don’t see any issue.

I say watchdog because after board reboots i see:
[ 8.994319] last reset is due to tegra watchdog timeout

This is the currently under debug (why there was flooding of bufferes and why it was not overwritting existing buffers in absence of HW flow control.)

Doing loopback test between different UART ports also give same result?

PS: I have requested for change in my screen name :)

Hi Galactus,

Thanks for trying out the loopback test. Your second attempt, with one “r” receive instance and one “t” transmit instance is correct. Also, before you start the loopback test, you will want to stop the serial console to alleviate collisions between ttyS0 and ttyTHS1 which use the same hardware:

sudo initctl stop ttyS0

Once you have stopped the serial console, start the receive function, then start the transmit function, then let it run. It may take some time, and they will likely fall out of sync, but will keep transmitting and receiving data, and if you let it run long enough we think you’ll see a crash of the OS followed by a reboot.

It is the OS crash that we are highly concerned with and would like additional attention paid to.

Thanks for testing that with us.

-Mark

Hi Mark,

Yes just after posting here I tried removing all debug/console from kernel command line parameters (in /boot/extlinux/extlinux.conf) and running tx and rx loopback test and now I see no errors in rx side.

rx: 210000
rx: 211000
rx: 212000
rx: 213000

tx: 180000
tx: 181000
tx: 182000
tx: 183000

I think we are in same page now.

Test is ongoing and I am waiting for kernel crash.

… stop the serial console to alleviate collisions between ttyS0 and ttyTHS1 which use the same hardware
Sorry, I couldn’t get this

I understand ttyS0 and THS1 are two different instances of UART HW
serial@70006000 – ttyS0
serial@70006040 – THS1

THS1 goes to M.2 connector.

Starting rx and then starting tx, it was running fine (no error prints in rx side) for quite some time.

As you suggested to wait as “they will likely fall out of sync”, I deliberately made then out of sync by stopping rx side loopback test app and starting again resulting in out of sync and now lots of errors are seen in rx side:

, should be: Ected value:
Error - unexpected value: , should be: F
Error - unexpected value: , should be: G
Error - unexpected value: , should be: H
Error - unexpected value: , should be: I
Error - unexpected value: !, should be: J

With this “out of sync”, bot rx and tx are still running and I am waiting for kernel to hang and watchdog timeout.

Both rx and tx are running (though out of sync) for past 25 minutes. I am letting it run longer.

Just wanted to confirm with you, what timeline do you see before kernel hang and reboot.

Hi Mark,

can you please also try removing these parameters from kernel command line (in boot/extlinux/extlinux.conf) and see if there is any change in behaviour

console=tty0 console=ttyS0,115200n8 debug_uartport=lsport,0 earlyprintk=uart8250-32bit,0x70006000

Thanks for trying this all out Galactus,

We had the same idea as you about ttyS0 and ttyTHS1, but it appears they are one and the same somewhere at a hardware level… this has been confusing to us as well. Once we eliminated the serial data collisions on THS1 and THS2, (THS2 by removing the camera from the development board), it takes a lot longer to crash the OS. If you want, you can try to accelerate the crash by changing a few things:

  1. Speed up the baud rate, (e.g. ./sertest -v -t -d /dev/ttyTHS1 -b 3000000), or
  2. Using 8n2, (2 stop bits), by changing line 201 of the loopback test:
// One stop bit
//    deviceOptions.c_cflag &= ~CSTOPB;

// Two stop bits
    deviceOptions.c_cflag &= CSTOPB;

For me, using 2 stop bits makes things crash almost instantly, otherwise, you might be stuck waiting a while with one stop bit even at high speed data rates.

I just re-verified on my system with the camera module removed and 2 stop bits, it crashed the OS within a second.

Let me know how it goes for you! Thanks again for trying this out!

-Mark

P.S. I will try removing the parameters and report back on my results.