Serial Kernel Failure

My team have a collection of Nvidia Jetson Orin NX with jetpack 5.1.2. They use an Aetina AIB SN-41 carrier board.
We have recently started observing repeated kernel failures on some of our newer devices when trying to send data over the serial port “/dev/ttyTHS1”. The failures are semi-random in that the exact time is not consistent but the failure always eventually occurs.

The failure can be recreated using the following python script

import serial
import time
ser = serial.Serial("/dev/ttyTHS1", 115200, timeout=1)
ser.flush()
i = 0
while True:
  msg= "456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456"
  i+=1
  if ser.writable():
    ser.write(msg.encode('ascii'))
    print('writing: ' + str(i))
    time.sleep(0.03333)

This script uses the pyserial library which can be installed with pip install pyserial

After about 2000 iterations the syslog should start showing serial failures of the form

[ 1340.194894] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1340.706891] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1341.218873] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1341.730481] serial-tegra 3110000.serial: RxData PIO to tty layer failed 	 
[ 1341.737943] tegra-gpcdma 2600000.gpcdma: slave id already in use        	 
[ 1341.744127] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1341.751401] serial-tegra 3110000.serial: Not able to get desc for Rx    	 
[ 1342.274470] serial-tegra 3110000.serial: RxData PIO to tty layer failed 	 
[ 1342.281351] tegra-gpcdma 2600000.gpcdma: slave id already in use 

This eventually results in a OS crash and system reboot.

The crash log from the debug uart port is shown below

nvidia@orinnx32: ~nvidia@orinnx32:~$ [ 1339.656643] serial-tegra 3110000.serid
[ 1340.194894] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1340.706891] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1341.218873] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1341.730481] serial-tegra 3110000.serial: RxData PIO to tty layer failed 	 
[ 1341.737943] tegra-gpcdma 2600000.gpcdma: slave id already in use        	 
[ 1341.744127] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1341.751401] serial-tegra 3110000.serial: Not able to get desc for Rx    	 
[ 1342.274470] serial-tegra 3110000.serial: RxData PIO to tty layer failed 	 
[ 1342.281351] tegra-gpcdma 2600000.gpcdma: slave id already in use        	 
[ 1342.287548] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1342.294825] serial-tegra 3110000.serial: Not able to get desc for Rx    	 
[ 1342.818457] serial-tegra 3110000.serial: RxData PIO to tty layer failed 	 
[ 1342.825343] tegra-gpcdma 2600000.gpcdma: slave id already in use        	 
[ 1342.831524] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1342.838802] serial-tegra 3110000.serial: Not able to get desc for Rx    	 
[ 1343.362819] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1343.874795] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1344.390414] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1344.899117] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1345.410400] serial-tegra 3110000.serial: RxData PIO to tty layer failed 	 
[ 1345.420471] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1345.954382] serial-tegra 3110000.serial: RxData PIO to tty layer failed 	 
[ 1345.973754] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1346.498733] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1347.010726] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1347.522346] serial-tegra 3110000.serial: RxData PIO to tty layer failed 	 
[ 1347.529222] tegra-gpcdma 2600000.gpcdma: slave id already in use        	 
[ 1347.535410] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1347.542683] serial-tegra 3110000.serial: Not able to get desc for Rx    	 
[ 1348.068075] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1348.590573] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1349.124097] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1349.640055] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1350.178269] serial-tegra 3110000.serial: RxData PIO to tty layer failed 	 
[ 1350.207440] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1350.722628] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1351.240907] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1351.778612] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1352.294597] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1352.802589] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1353.314202] serial-tegra 3110000.serial: RxData PIO to tty layer failed 	 
[ 1353.321092] tegra-gpcdma 2600000.gpcdma: slave id already in use        	 
[ 1353.327278] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1353.334560] serial-tegra 3110000.serial: Not able to get desc for Rx    	 
[ 1353.858185] serial-tegra 3110000.serial: RxData PIO to tty layer failed 	 
[ 1353.865607] serial-tegra 3110000.serial: RxData PIO to tty layer failed 	 
[ 1353.874486] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1354.407554] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1354.946168] serial-tegra 3110000.serial: RxData PIO to tty layer failed 	 
[ 1354.957516] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1355.490760] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1356.002137] serial-tegra 3110000.serial: RxData PIO to tty layer failed 	 
[ 1356.009033] tegra-gpcdma 2600000.gpcdma: slave id already in use        	 
[ 1356.015224] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1356.022518] serial-tegra 3110000.serial: Not able to get desc for Rx    	 
[ 1356.546505] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1357.058118] serial-tegra 3110000.serial: RxData PIO to tty layer failed 	 
[ 1357.065942] tegra-gpcdma 2600000.gpcdma: slave id already in use        	 
[ 1357.072150] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1357.079415] serial-tegra 3110000.serial: Not able to get desc for Rx    	 
[ 1357.602103] serial-tegra 3110000.serial: RxData PIO to tty layer failed 	 
[ 1357.608999] tegra-gpcdma 2600000.gpcdma: slave id already in use        	 
[ 1357.615188] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1357.622472] serial-tegra 3110000.serial: Not able to get desc for Rx    	 
[ 1358.146092] serial-tegra 3110000.serial: RxData PIO to tty layer failed 	 
[ 1358.152974] tegra-gpcdma 2600000.gpcdma: slave id already in use        	 
[ 1358.159173] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1358.166441] serial-tegra 3110000.serial: Not able to get desc for Rx    	 
[ 1358.690478] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1359.206580] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1359.714419] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 1360.226035] serial-tegra 3110000.serial: RxData PIO to tty layer failed 	 
[ 1360.232934] tegra-gpcdma 2600000.gpcdma: slave id already in use        	 
[ 1360.239142] serial-tegra 3110000.serial: Not able to get desc for Rx    	 
[ 1360.245709] Unable to handle kernel NULL pointer dereference at virtual addr4
[ 1360.245710] Mem abort info:                                             	 
[ 1360.245712]   ESR = 0x96000004                                          	 
[ 1360.245714]   EC = 0x25: DABT (current EL), IL = 32 bits                	 
[ 1360.245715]   SET = 0, FnV = 0                                          	 
[ 1360.245715]   EA = 0, S1PTW = 0                                         	 
[ 1360.245716] Data abort info:                                            	 
[ 1360.245717]   ISV = 0, ISS = 0x00000004                                 	 
[ 1360.245717]   CM = 0, WnR = 0                                           	 
[ 1360.245720] user pgtable: 4k pages, 48-bit VAs, pgdp=000000013ba65000   	 
[ 1360.245721] [0000000000000004] pgd=0000000000000000, p4d=0000000000000000    
[ 1360.245727] Internal error: Oops: 96000004 [#1] PREEMPT SMP             	 
[ 1360.245730] Modules linked in: nvidia_modeset(O) nf_conntrack_netlink nfnetlt
[ 1360.245797]  snd_soc_tegra210_ahub nvidia(O) spi_tegra114 binfmt_misc nvmap ]
[ 1360.245811] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G       	O  	5.10.17
[ 1360.245812] Hardware name: Unknown NVIDIA Orin NX Developer Kit/NVIDIA Orin 3
[ 1360.245815] pstate: 60400089 (nZCv daIf +PAN -UAO -TCO BTYPE=--)        	 
[ 1360.245826] pc : tegra_uart_rx_buffer_push+0x38/0x190                   	 
[ 1360.245828] lr : tegra_uart_rx_buffer_push+0x30/0x190                   	 
[ 1360.245828] sp : ffff800010003570                                       	 
[ 1360.245829] x29: ffff800010003570 x28: 0000000000000002                 	 
[ 1360.245831] x27: 0000000000000005 x26: ffff4157c366f1b0                 	 
[ 1360.245832] x25: ffffffffffffffca x24: 0000000000000001                 	 
[ 1360.245834] x23: 0000000000000005 x22: ffff4157c366f1b0                 	 
[ 1360.245835] x21: ffff41580094ec00 x20: 0000000000000fa0                 	 
[ 1360.245837] x19: ffff4157c6838c80 x18: 0000000000000010                 	 
[ 1360.245838] x17: 0000000000000000 x16: ffffb3f323225220                 	 
[ 1360.245840] x15: ffffb3f324f22bf0 x14: ffffffffffffffff                 	 
[ 1360.245841] x13: ffff800090003917 x12: ffff80001000391f                 	 
[ 1360.245843] x11: 0000000000000040 x10: ffffb3f324fa7b60                 	 
[ 1360.245845] x9 : ffffb3f324fa7b58 x8 : ffff4157c0400b90                 	 
[ 1360.245846] x7 : 0000000000000000 x6 : 0000000a01d9cfd1                 	 
[ 1360.245847] x5 : ffff4157c682e088 x4 : ffff415b2e796140                 	 
[ 1360.245849] x3 : 0000000000000000 x2 : ffffb3f3233c0d70                 	 
[ 1360.245851] x1 : 0000000000000000 x0 : ffff41580094ec00                 	 
[ 1360.245853] Call trace:                                                 	 
[ 1360.245855]  tegra_uart_rx_buffer_push+0x38/0x190                       	 
[ 1360.245857]  tegra_uart_terminate_rx_dma+0x84/0xe0                      	 
[ 1360.245859]  tegra_uart_isr+0x41c/0x4a0                                 	 
[ 1360.245865]  __handle_irq_event_percpu+0x68/0x2a0                       	 
[ 1360.245867]  handle_irq_event_percpu+0x40/0xa0                          	 
[ 1360.245869]  handle_irq_event+0x50/0xf0                                 	 
[ 1360.245871]  handle_fasteoi_irq+0xc0/0x170                              	 
[ 1360.245873]  generic_handle_irq+0x40/0x60                               	 
[ 1360.245875]  __handle_domain_irq+0x70/0xd0                              	 
[ 1360.245878]  gic_handle_irq+0x68/0x134                                  	 
[ 1360.245879]  el1_irq+0xd0/0x180                                         	 
[ 1360.245881]  console_unlock+0x36c/0x540                                 	 
[ 1360.245883]  vprintk_emit+0x124/0x2a0                                   	 
[ 1360.245887]  dev_vprintk_emit+0x154/0x184                               	 
[ 1360.245888]  dev_printk_emit+0x80/0xa8                                  	 
[ 1360.245889]  __dev_printk+0x7c/0xa4                                     	 
[ 1360.245890]  _dev_err+0x74/0x9c                                         	 
[ 1360.245892]  tegra_uart_start_rx_dma+0x128/0x140                        	 
[ 1360.245893]  tegra_uart_rx_error_handle_timer+0xe4/0xf0                 	 
[ 1360.245896]  call_timer_fn+0x3c/0x200                                   	 
[ 1360.245897]  run_timer_softirq+0x50c/0x5e0                              	 
[ 1360.245898]  __do_softirq+0x140/0x3e8                                   	 
[ 1360.245901]  irq_exit+0xc0/0xe0                                         	 
[ 1360.245903]  __handle_domain_irq+0x74/0xd0                              	 
[ 1360.245903]  gic_handle_irq+0x68/0x134                                  	 
[ 1360.245904]  el1_irq+0xd0/0x180                                         	 
[ 1360.245909]  cpuidle_enter_state+0xb8/0x410                             	 
[ 1360.245911]  cpuidle_enter+0x40/0x60                                    	 
[ 1360.245913]  call_cpuidle+0x44/0x80                                     	 
[ 1360.245914]  do_idle+0x208/0x270                                        	 
[ 1360.245915]  cpu_startup_entry+0x30/0x70                                	 
[ 1360.245918]  rest_init+0xdc/0xe8                                        	 
[ 1360.245922]  arch_call_rest_init+0x18/0x20                              	 
[ 1360.245924]  start_kernel+0x500/0x538                                   	 
[ 1360.245928] Code: aa1603e0 97ff1e09 f9413a61 aa0003f5 (b9400420)        	 
[ 1360.245937] ---[ end trace 7f5703c452e99bb1 ]---                        	 
[ 1360.250143] Kernel panic - not syncing: Oops: Fatal exception in interrupt   
[ 1360.250149] SMP: stopping secondary CPUs                                	 
[ 1360.250156] Kernel Offset: 0x33f313200000 from 0xffff800010000000       	 
[ 1360.250157] PHYS_OFFSET: 0xffffbea940000000                             	 
[ 1360.250160] CPU features: 0x08040006,4a80aa38                           	 
[ 1360.250162] Memory Limit: none                                          	 
[ 1360.713147] ---[ end Kernel panic - not syncing: Oops: Fatal exception in in-
�'

The jetpack version is shown below

nvidia@orinnx32:~$ sudo apt-cache show nvidia-jetpack
Package: nvidia-jetpack
Version: 5.1.2-b104
Architecture: arm64
Maintainer: NVIDIA Corporation
Installed-Size: 194
Depends: nvidia-jetpack-runtime (= 5.1.2-b104), nvidia-jetpack-dev (= 5.1.2-b104)
Homepage: http://developer.nvidia.com/jetson
Priority: standard
Section: metapackages
Filename: pool/main/n/nvidia-jetpack/nvidia-jetpack_5.1.2-b104_arm64.deb
Size: 29304
SHA256: fda2eed24747319ccd9fee9a8548c0e5dd52812363877ebe90e223b5a6e7e827
SHA1: 78c7d9e02490f96f8fbd5a091c8bef280b03ae84
MD5sum: 6be522b5542ab2af5dcf62837b34a5f0
Description: NVIDIA Jetpack Meta Package
Description-md5: ad1462289bdbc54909ae109d1d32c0a8

cat /etc/nv_tegra_release
# R35 (release), REVISION: 4.1, GCID: 33958178, BOARD: t186ref, EABI: aarch64, DATE: Tue Aug  1 19:57:35 UTC 2023

# OS Version
Ubuntu 20.04.6 LTS (GNU/Linux 5.10.120 aarch64)

We have noticed that reducing the size of the message or increasing the baud rate seems to mitigate this issue, but the serial log errors still occur sometimes which reduces our confidence that it’s a permanent fix to up the baud rate.

We are also reaching out to Aetina about this issue.

There isn’t much I can do to actually help, but I suspect it is a signal quality issue. I do have some suggestions though.

First, the quality of cable can matter, along with length and noise in the environment. What you might do is test without the cable in loopback mode. In the case of not using flow control, all you would do is jumper the TX to the RX of “/dev/ttyTHS1”. You’d have to both send and receive, and perhaps compare send to receive in the same program. If flow control were involved, then you’d have to also connect CTS and RTS.

Keep in mind that you should try this at first right at where the pins are on the carrier board, but then you could connect your serial cable (I’m assuming it isn’t USB) and do this again at the far end of the cable. Compare if direct TX/RX short works differently compared to the remote end.

If there is no issue in loopback mode, then you can move on to the cable and noise. I would expect better results at lower speeds for a cable quality issue. I might expect that if loopback works locally, but adding a cable increases failures, that it could be not only signal quality, but perhaps outside noise. You might try a shielded cable. Also, note that since this is a square wave environment, that it is possible that particular data patterns are an issue, but some are not (or less so). The length of cable might matter with some data patterns, but not with others., so perhaps also try pattern 0x55 0xAA or 0xAA 0x55 or pure 0x00 or pure 0xFF.

Thanks for the suggestions.
We have been working this with Aetina and tracked the issue to an RS232 converter chip we had connected via Aetina’s RS232 GPIO pins.

It’s strange to me that a faulty serial connection could crash the whole device though. I would normally expect just garbage data or no communication.

What happens in part depends on whether the error is a problem in user space (a regular program), or whether it occurs in kernel space (the driver). No doubt the programs which actually use the data of the UART can fail or deal with corruption such that only that program fails. The driver though might not have error detection of actual hardware failure. Kernel space is not entirely isolated the way user processes are when they use virtual addressing (kernel space tends to use physical addressing).

Consider some contrived example that there is a “NOT” gate. Maybe one wire off the input to the NOT gate goes somewhere, and another wire comes off of the output of the NOT gate. One would never write software to handle the case of the NOT gate failing to invert the bit. There is actually a case for detecting hardware errors with the ARM Cortex-R series for hard realtime (or DIMMs with checksums). That particular hardware has two or more CPU cores which run in parallel, and it can detect a failure and leave the working core to do the job (the other core is a “shadow” core that always has the same state as the visible core, and when they differ, some method is used to decide if the shadow core or the working core is wrong, and will seamlessly switch to whichever core is working). Those are specialized though, and tend to be used in aircraft control systems and other avionics. Self-driving cars would have this kind of hardware. There are some non-Jetson NVIDIA systems which also have this feature. That other hardware is quite different than the ARM Cortex-A hardware of a Jetson.

Trivia: The Jetsons do have a couple of ARM Cortex-R5s, but mostly they are inaccessible to the end user. One is for the audio processing engine (APE), and the other is for the image signal processor (ISP). Both are able to run hard realtime data acquisition, and if there is a failure, then it is in the stages after the Cortex-R5s. In this case there are no shadow cores, and so you cannot get hardware failure detection (the other notable feature of what a Cortex-R can do is that of hard realtime).

More trivia: An ARM Cortex-M can run hard realtime like the Cortex-R, but it cannot use shadow cores. When there are shadow cores you can refer to it as “functional safety”. A second feature of Cortex-R is related to the fact that scheduling of hard realtime becomes much more difficult as the number of threads or processes increase, and the Cortex-R can handle more due to hardware scheduling assist (the Cortex-M simply gets overwhelmed at some point as you increase the number of processes; the Cortex-R does too, just not as soon).

All of this runs in the kernel, and if you do something wrong in hardware there is no doubt that it can bring down the whole system.

It is a fascinating discovery though regarding the RS232 converter chip. Such issues are extremely difficult to find.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.