Jetson Nano crashes after 3 to 10 days of operations

• Hardware Platform (Jetson / GPU) Jetson Nano
• DeepStream Version 5.0
• JetPack Version (valid for Jetson only) 4.4-b144
• TensorRT Version 7.1.3
• Issue Type( questions, new requirements, bugs) question

Dear All,

I have been experiencing crashes in Jetson nano for a couple of years now.

After a random number of days with very little stress on the board (not even using the GPU), the system get stuck after a self-reboot.
Apparently, nothing gets logged after the reboot procedure.
The syslog contains only the entries from the kernel log.

This is an extract of kernel log from a normal boot:

Oct 15 16:15:27 jnano-desktop kernel: [   25.007166] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
Oct 15 16:15:29 jnano-desktop kernel: [   27.375455] zram: Added device: zram0
Oct 15 16:15:29 jnano-desktop kernel: [   27.375862] zram: Added device: zram1
Oct 15 16:15:29 jnano-desktop kernel: [   27.376234] zram: Added device: zram2
Oct 15 16:15:29 jnano-desktop kernel: [   27.376598] zram: Added device: zram3
Oct 15 16:15:29 jnano-desktop kernel: [   27.397458] zram0: detected capacity change from 0 to 519585792
Oct 15 16:15:29 jnano-desktop kernel: [   27.521702] Adding 507404k swap on /dev/zram0.  Priority:5 extents:1 across:507404k SS
Oct 15 16:15:29 jnano-desktop kernel: [   27.524794] zram1: detected capacity change from 0 to 519585792
Oct 15 16:15:29 jnano-desktop kernel: [   27.535941] Adding 507404k swap on /dev/zram1.  Priority:5 extents:1 across:507404k SS
Oct 15 16:15:29 jnano-desktop kernel: [   27.539100] zram2: detected capacity change from 0 to 519585792
Oct 15 16:15:29 jnano-desktop kernel: [   27.549878] Adding 507404k swap on /dev/zram2.  Priority:5 extents:1 across:507404k SS
Oct 15 16:15:29 jnano-desktop kernel: [   27.552926] zram3: detected capacity change from 0 to 519585792
Oct 15 16:15:29 jnano-desktop kernel: [   27.568009] Adding 507404k swap on /dev/zram3.  Priority:5 extents:1 across:507404k SS
Oct 15 16:15:31 jnano-desktop kernel: [   29.082315] r8168: eth0: link up
Oct 15 16:15:31 jnano-desktop kernel: [   29.082359] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Oct 15 16:15:46 jnano-desktop kernel: [   43.762370] tegradc tegradc.0: blank - powerdown
Oct 15 16:15:46 jnano-desktop kernel: [   43.762380] tegradc tegradc.1: blank - powerdown
Oct 15 16:15:53 jnano-desktop kernel: [   51.233314] Bridge firewalling registered
Oct 15 16:15:53 jnano-desktop kernel: [   51.612068] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
Oct 15 16:15:54 jnano-desktop kernel: [   52.541180] Netfilter messages via NETLINK v0.30.
Oct 15 16:15:54 jnano-desktop kernel: [   52.549235] ctnetlink v0.93: registering with nfnetlink.
Oct 15 16:15:55 jnano-desktop kernel: [   52.937662] IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready
Oct 15 16:15:59 jnano-desktop kernel: [   57.638081] Bluetooth: BNEP (Ethernet Emulation) ver 1.3
Oct 15 16:15:59 jnano-desktop kernel: [   57.638092] Bluetooth: BNEP socket layer initialized
Oct 15 16:16:00 jnano-desktop kernel: [   58.327758] fuse init (API version 7.26)
Oct 15 16:16:08 jnano-desktop kernel: [   66.717560] EXT4-fs (mmcblk0p1): warning: mounting fs with errors, running e2fsck is recommended
Oct 15 16:16:08 jnano-desktop kernel: [   66.729308] EXT4-fs (mmcblk0p1): recovery complete
Oct 15 16:16:08 jnano-desktop kernel: [   66.731465] EXT4-fs (mmcblk0p1): mounted filesystem with ordered data mode. Opts: (null)
Oct 15 16:16:33 jnano-desktop kernel: [   91.154819] nvgpu: 57000000.gpu             railgate_enable_store:297  [INFO]  railgate is disabled.

Instead, this is the last part of the kernel log from the faulty boot procedure:

Oct 14 22:51:27 jnano-desktop kernel: [   24.071307] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
Oct 14 22:51:29 jnano-desktop kernel: [   26.085068] zram: Added device: zram0
Oct 14 22:51:29 jnano-desktop kernel: [   26.085674] zram: Added device: zram1
Oct 14 22:51:29 jnano-desktop kernel: [   26.088345] zram: Added device: zram2
Oct 14 22:51:29 jnano-desktop kernel: [   26.089650] zram: Added device: zram3
Oct 14 22:51:29 jnano-desktop kernel: [   26.119664] zram0: detected capacity change from 0 to 518537216
Oct 14 22:51:29 jnano-desktop kernel: [   26.169756] Adding 506380k swap on /dev/zram0.  Priority:5 extents:1 across:506380k SS
Oct 14 22:51:29 jnano-desktop kernel: [   26.173432] zram1: detected capacity change from 0 to 518537216
Oct 14 22:51:29 jnano-desktop kernel: [   26.195861] Adding 506380k swap on /dev/zram1.  Priority:5 extents:1 across:506380k SS
Oct 14 22:51:29 jnano-desktop kernel: [   26.216176] zram2: detected capacity change from 0 to 518537216
Oct 14 22:51:29 jnano-desktop kernel: [   26.251188] Adding 506380k swap on /dev/zram2.  Priority:5 extents:1 across:506380k SS
Oct 14 22:51:29 jnano-desktop kernel: [   26.260994] zram3: detected capacity change from 0 to 518537216
Oct 14 22:51:29 jnano-desktop kernel: [   26.275883] Adding 506380k swap on /dev/zram3.  Priority:5 extents:1 across:506380k SS
Oct 14 22:51:31 jnano-desktop kernel: [   28.154587] r8168: eth0: link up
Oct 14 22:51:31 jnano-desktop kernel: [   28.155008] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Oct 14 22:51:47 jnano-desktop kernel: [   43.568707] tegradc tegradc.0: blank - powerdown
Oct 14 22:51:47 jnano-desktop kernel: [   43.631766] extcon-disp-state extcon:disp-state: cable 47 state 0
Oct 14 22:51:47 jnano-desktop kernel: [   43.631769] Extcon AUX1(HDMI) disable
Oct 14 22:51:47 jnano-desktop kernel: [   43.652225] tegradc tegradc.0: unblank
Oct 14 22:51:47 jnano-desktop kernel: [   43.665053] tegradc tegradc.0: nominal-pclk:148500000 parent:148500000 div:1.0 pclk:148500000 147015000~161865000
Oct 14 22:51:47 jnano-desktop kernel: [   43.665138] tegradc tegradc.0: hdmi: tmds rate:148500K prod-setting:prod_c_hdmi_75m_150m
Oct 14 22:51:47 jnano-desktop kernel: [   43.666822] tegradc tegradc.0: hdmi: get RGB quant from EDID.
Oct 14 22:51:47 jnano-desktop kernel: [   43.666829] tegradc tegradc.0: hdmi: get YCC quant from EDID.
Oct 14 22:51:47 jnano-desktop kernel: [   43.701606] extcon-disp-state extcon:disp-state: cable 47 state 1
Oct 14 22:51:47 jnano-desktop kernel: [   43.701609] Extcon AUX1(HDMI) enable
Oct 14 22:51:47 jnano-desktop kernel: [   43.707836] tegradc tegradc.0: unblank
Oct 14 22:51:47 jnano-desktop kernel: [   43.708249] tegradc tegradc.1: blank - powerdown
Oct 14 22:51:51 jnano-desktop kernel: [   48.352863] Bridge firewalling registered
Oct 14 22:51:51 jnano-desktop kernel: [   48.390972] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
Oct 14 22:51:52 jnano-desktop kernel: [   48.997819] Netfilter messages via NETLINK v0.30.
Oct 14 22:51:52 jnano-desktop kernel: [   49.001449] ctnetlink v0.93: registering with nfnetlink.
Oct 14 22:51:52 jnano-desktop kernel: [   49.377000] IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready

The difference I notice is that nothing is said about the FS mounting.

Do you have any idea why the Jetson Nano behaves like that?

Thanks in advance!

The mounting errors are actually from not shutting down properly prior to that boot (the crash did not allow proper shutdown). Those particular messages are about recovering by “unwriting” changes which were about to occur, but which did not get recorded as complete. You lose some sort of change which was not yet flushed to disk prior to failure, but overall, the filesystem is not corrupt.

The logs do not show anything to say why there was a failure. For that you probably need to monitor it with serial console and keep logging. The log goes to the second computer, and does not require much of the system to survive to log. Can you start a serial console log to another computer and let it run until it fails? This should catch information which the log of the next boot would not normally catch.

Dear @linuxdev ,

thanks a lot for your reply.

please find here below an extraction of the kern.log in which you see what preceded the “faulty” boot.

Oct 11 04:42:18 jnano-desktop kernel: [   47.818963] Bridge firewalling registered
Oct 11 04:42:18 jnano-desktop kernel: [   48.502646] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
Oct 11 04:42:19 jnano-desktop kernel: [   49.617482] Netfilter messages via NETLINK v0.30.
Oct 11 04:42:19 jnano-desktop kernel: [   49.623633] ctnetlink v0.93: registering with nfnetlink.
Oct 11 04:42:20 jnano-desktop kernel: [   50.419660] IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready
Oct 12 11:58:45 jnano-desktop kernel: [   56.670950] Bluetooth: BNEP (Ethernet Emulation) ver 1.3
Oct 12 11:58:45 jnano-desktop kernel: [   56.670961] Bluetooth: BNEP socket layer initialized
Oct 12 11:58:47 jnano-desktop kernel: [   58.414492] fuse init (API version 7.26)
Oct 12 11:58:53 jnano-desktop kernel: [   64.043116] EXT4-fs (mmcblk0p1): warning: mounting fs with errors, running e2fsck is recommended
Oct 12 11:58:53 jnano-desktop kernel: [   64.064419] EXT4-fs (mmcblk0p1): recovery complete
Oct 12 11:58:53 jnano-desktop kernel: [   64.064429] EXT4-fs (mmcblk0p1): mounted filesystem with ordered data mode. Opts: (null)
Oct 12 11:59:20 jnano-desktop kernel: [   91.162734] nvgpu: 57000000.gpu             railgate_enable_store:297  [INFO]  railgate is disabled.
Oct 12 12:03:54 jnano-desktop kernel: [  365.678834] EXT4-fs (mmcblk0p1): error count since last fsck: 6
Oct 12 12:03:54 jnano-desktop kernel: [  365.684765] EXT4-fs (mmcblk0p1): initial error at time 1597765637: htree_dirblock_to_tree:991: inode 464593: block 1592264
Oct 12 12:03:54 jnano-desktop kernel: [  365.695816] EXT4-fs (mmcblk0p1): last error at time 1597767749: htree_dirblock_to_tree:991: inode 285072: block 1059628
Oct 13 13:40:43 jnano-desktop kernel: [92572.675832] EXT4-fs (mmcblk0p1): error count since last fsck: 6
Oct 13 13:40:43 jnano-desktop kernel: [92572.681762] EXT4-fs (mmcblk0p1): initial error at time 1597765637: htree_dirblock_to_tree:991: inode 464593: block 1592264
Oct 13 13:40:43 jnano-desktop kernel: [92572.692807] EXT4-fs (mmcblk0p1): last error at time 1597767749: htree_dirblock_to_tree:991: inode 285072: block 1059628
Oct 14 15:18:38 jnano-desktop kernel: [184845.205474] EXT4-fs (mmcblk0p1): error count since last fsck: 6
Oct 14 15:18:38 jnano-desktop kernel: [184845.211496] EXT4-fs (mmcblk0p1): initial error at time 1597765637: htree_dirblock_to_tree:991: inode 464593: block 1592264
Oct 14 15:18:38 jnano-desktop kernel: [184845.222639] EXT4-fs (mmcblk0p1): last error at time 1597767749: htree_dirblock_to_tree:991: inode 285072: block 1059628
Oct 14 22:51:26 jnano-desktop kernel: [    0.000000] Booting Linux on physical CPU 0x0
Oct 14 22:51:26 jnano-desktop kernel: [    0.000000] Linux version 4.9.140-tegra (buildbrain@mobile-u64-3456) (gcc version 7.3.1 20180425 [linaro-7.3-2018.05 revision d29120a424ecfbc167ef90065c0eeb7f91977701] (Linaro GCC 7.3-2018.05) ) #1 SMP PREEMPT Thu Jun 25 21:25:44 PDT 2020
Oct 14 22:51:26 jnano-desktop kernel: [    0.000000] Boot CPU: AArch64 Processor [411fd071]
Oct 14 22:51:26 jnano-desktop kernel: [    0.000000] OF: fdt:memory scan node memory@80000000, reg size 48,
Oct 14 22:51:26 jnano-desktop kernel: [    0.000000] OF: fdt: - 80000000 ,  7ee00000
Oct 14 22:51:26 jnano-desktop kernel: [    0.000000] OF: fdt: - 100000000 ,  7f200000
Oct 14 22:51:26 jnano-desktop kernel: [    0.000000] Found tegra_fbmem: 00800000@92cb4000
Oct 14 22:51:26 jnano-desktop kernel: [    0.000000] earlycon: uart8250 at MMIO32 0x0000000070006000 (options '')
Oct 14 22:51:26 jnano-desktop kernel: [    0.000000] bootconsole [uart8250] enabled
Oct 14 22:51:26 jnano-desktop kernel: [    0.000000] OF: fdt:Reserved memory: failed to reserve memory for node 'fb0_carveout': base 0x0000000000000000, size 0 MiB
Oct 14 22:51:26 jnano-desktop kernel: [    0.000000] OF: fdt:Reserved memory: failed to reserve memory for node 'fb0_carveout': base 0x0000000000000000, size 0 MiB
Oct 14 22:51:26 jnano-desktop kernel: [    0.000000] OF: fdt:Reserved memory: failed to reserve memory for node 'fb1_carveout': base 0x0000000000000000, size 0 MiB
Oct 14 22:51:26 jnano-desktop kernel: [    0.000000] OF: fdt:Reserved memory: failed to reserve memory for node 'fb1_carveout': base 0x0000000000000000, size 0 MiB
Oct 14 22:51:26 jnano-desktop kernel: [    0.000000] OF: reserved mem: initialized node vpr-carveout, compatible id nvidia,vpr-carveout
Oct 14 22:51:26 jnano-desktop kernel: [    0.000000] OF: reserved mem: initialized node iram-carveout, compatible id nvidia,iram-carveout
Oct 14 22:51:26 jnano-desktop kernel: [    0.000000] OF: reserved mem: initialized node ramoops_carveout, compatible id nvidia,ramoops
Oct 14 22:51:26 jnano-desktop kernel: [    0.000000] cma: Reserved 64 MiB at 0x00000000fac00000
Oct 14 22:51:26 jnano-desktop kernel: [    0.000000] On node 0 totalpages: 1039872
Oct 14 22:51:26 jnano-desktop kernel: [    0.000000]   DMA zone: 8192 pages used for memmap
Oct 14 22:51:26 jnano-desktop kernel: [    0.000000]   DMA zone: 0 pages reserved
Oct 14 22:51:26 jnano-desktop kernel: [    0.000000]   DMA zone: 519168 pages, LIFO batch:31
Oct 14 22:51:26 jnano-desktop kernel: [    0.000000]   Normal zone: 8136 pages used for memmap
Oct 14 22:51:26 jnano-desktop kernel: [    0.000000]   Normal zone: 520704 pages, LIFO batch:31

Do you have any hint from those EXT4-fs errors?

Thanks a lot!

I would suggest to dump log from serial console.

The newest log is from the worst case of shutting down improperly. The data which was being written at the time of error exceeded the size of the journal and there is no way to get a consistent filesystem. Automatic recovery is not possible, and the software is basically asking if you are brave enough to ok deleting random content (which can go beyond just the software which was being written at the time of power loss). You’ll have to verify, but from what I see you did not get to a regular login prompt (what I say below changes somewhat if you did get to a login).

You can attempt to recover, but it may never work right, and you might not know this until you’ve been using it for some time. There will be many parts which are not corrupt, and so if you have some content you are in need of, then it might still be there without corruption. If you were to simply use this without repair the errors are such that the corruption would propagate whenever you write new content, and so the system is refusing to continue without inconsistencies being fixed (fixing inconsistencies is where you lose content).

How important is this? Regardless, if you want to fix it, I recommend first cloning. If this is an SD card model, then this is simple, you can just clone it directly on your host PC, and then try to repair the SD card itself while keeping the clone for backup. If it is an eMMC model, then you have to use the flash software to clone, fix the clone, and then flash the repaired clone. Or just flash and be done with it, though this loses your current content.

Thanks @linuxdev for your message, very kind.
You are right, the boot of Oct 14 22:51:26 was not succesfull. The board got stuck in a not booting state.
However, that shutdown did not come by me or any controlled factor.
As you see the precedent boot was happening on Oct 12 11:58:45 the system have been running for more than 2 days and then it rebooted, but I do not understand what happened on the 14th October.

How did you manage to understand the situation from the log?

As an additional detail: I am currently booting the system from USB drive. Do you think this might bring more clarity. In any case I do not have experienced any data loss so far. And I am keeping the system ON with Grafana and Influxdb to collect data since last year.

(I have followed this article. The SD Card is still required for the boot.)

Today, I have also setup a serial console with a Raspberry pi that I will keep active to log the data from the Jetson Nano.

I think here I am having 2 issues:

  1. Unwanted reboots
  2. Some of those unwanted reboots causes the system to stuck on the next boot for the problem you were explaining.

Do you have any idea about the reason?

Meantime. as a barbaric countermeasure for the problem I thought to setup a crontab with a controlled reboot command every 24h with the hope that restoring the system to nominal conditions daily will mitigate the situation.

Looking forward to hearing your opinion!

Thanks again!!!

  1. Unwanted reboots
  2. Some of those unwanted reboots causes the system to stuck on the next boot for the problem you were explaining.
    Do you have any idea about the reason?

Generally, we need your log to tell what is going on. A “Unwanted reboots” could be kernel panic or other unstable power supply issue. If this is kernel panic, then serial console can tell. But syslog cannot.

That is why I want to check the log.

The logs so far only say the filesystem is corrupt from improper shutdown. You would have had to have been recording on serial console at the time it shut down to get a clue as to why it happened. After it is reflashed, if you think it will do it again, then make sure you have a continuous serial console log running (preferably running “dmesg --follow”).

If you have valuable content you want to save before flashing information can be provided (if it is an SD card model, then you can either clone the SD card or use a new one; with eMMC models you have to use the flash software to clone).

In the case of power, was this an NVIDIA-provided power source?

Crontab might be an ok way to work around it, but if it is a power issue, then it won’t save it forever. A log from serial console at the moment of a failure would say a lot.

Dear @linuxdev thanks again!
I am running a commercial 5V power supply, not the NVIDIA-provided one.
However, I have a Jetson B01 running since beginning of 2021 with the very same power supply without having experienced any reboot. Both the Jetsons are in the same configuration booting from an SSD.
What changes is that the one experiencing the reboot is a Jetson A02.

Dear @WayneWWW, I am currently logging with serial monitor.
I have quickly setup a raspberry pi with nodered and grafana to log the serial input and trigger some alerts when new lines are sent over serial console. However, that system proved to be less stable than Jetson :D
Let’s see if I am able to get any information from unwanted reboots.

Thanks!

Dear @linuxdev and @WayneWWW ,

I have captured a crash through serial console.
It seems to be disk access. (?)

Please find the log attached to the message.
Looking forward to hearing your comments, and thank you very much again.
serial_log.txt (3.1 MB)

I don’t know if it is from previous improper shutdowns, but the filesystem is giving an enormous amount of error messages. There is no way to fix this, you will have to flash this one with a clean install. It is possible that the hardware itself is bad in some way which is causing this, but there is no way to know without trying again with a known clean filesystem. Can you flash with a new install and keep logging like this last log in case it fails? The log was very useful, and if there is a hardware problem (versus software from crashes or power loss corrupting the filesystem over time), then these errors will probably continue to happen at the same rate.

I want to expand on this by saying the errors are generally on an NVMe, not the eMMC. In the case of NVMe hardware error, then this can be the NVMe itself, or if a cable is used, then the cable. If this is mounted directly, e.g., via m.2 card, then it might be the connection not being quite right (in which cause unseating and reseating the card prior to flash might help). The fact is though that it seems to fail for entire blocks, and not just random parts of the blocks, which looks more like hardware than software. If this fails again, then it would be useful to put the NVMe on another computer, format it there, and stress test it on the other computer. This would verify if the hardware is bad or not, but you’d want to monitor “dmesg --follow” on the host PC with the stress testing (on the Jetson this appears to be the rootfs, but on the PC this would just be a non-critical piece of hardware, and thus is less likely to crash).

Note: I can show some basic testing on the PC which would be done by making a dd backup. I sometimes use dd with failing drives for data recovery, and dd has options for backing up from a failing drive; leaving out those options, and having the read or write fail, is a good indication of failing hardware.

Thanks again for the message!

I have been discussing with @WayneWWW about reboots in 2020 as well.

At the time I have solved with activating jetsonclocks that disable the Dynamic Voltage and Frequency Scaling (DVFS).
Back in 2020 there was no reboot problem and the hardware is exactly the same. Here I have the A02 model running with boot from SSD USB3.0.

Your hypothesis is that something has degraded over time and now it is susceptible to errors (e.g. the cable or the SSD).
What is not convincing me is that if I reset the system once per day, I experience no reboots.
In this last case the bad reboot was happening after 38hours.

I have recorded a new type of error by means of stressing the Jetson Nano with an inference pipeline.
The error is the following:

[79500.960911] EXT4-fs warning (device sda1): dx_probe:743: inode #1048577: lblock 0: comm gmain: error -5 reading directory block
[79501.260581] usb 2-1.4: Device not responding to setup address.
[79502.472535] usb 2-1.4: Device not responding to setup address.
[79502.684320] usb 2-1.4: device not accepting address 11, error -71
[79502.691272] usb 2-1-port4: 

In my opinion the SSD I am using has good quality. It’s a standard Kingston SSD.
Do you think it is possible that the USB3.0 to SATA adapter I bought on amazon might be the problem?

Thanks

Rebooting once per day could mean not running at as high of a temperature, which in turn could still be hardware (but the hardware might temporarily degrade over time when at temperature…does running at a high disk load increase reboot rate?). And of course this could also be software.

The “Device not responding to setup address” is indeed hardware, but it is USB. This does not mean it is actually USB failing for this particular case, but it could be. It could also be the NVMe since setup address requires participation of the NVMe (but it is USB hardware and software so odds shift towards USB). Can you try the NVMe on a different USB port? If you run the command “lsusb -t” before changing port you’ll find a “root_hub”; the best test is if you move to a port which is a different “root_hub” (though this might not be possible in all cases). However, it is still an NVMe issue regardless of which “wire” is at fault (or if it is the NVMe failing to “talk” to that wire). It is definitely a block device failure though.

A USB3.0 SATA adapter is a good candidate for being at fault in this particular case. This is the “middleman” when setting up the address, but there is no guarantee this is bad. Perhaps this too only fails when it is at temperature and has been running for days. It is hard to tell without something like a logic analyzer on the USB at the time.

I think temperature is not a problem.
I am also recording the temperature of the board through influxdb.

The peaks on the right side are relative to when I have activated my inference application on the jetson nano.
I think this temperature level is perfectly normal for Jetson Nano.

The output of lsusb -t is the following:

/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=tegra-xusb/4p, 5000M
    |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/4p, 5000M
        |__ Port 4: Dev 3, If 0, Class=Mass Storage, Driver=usb-storage, 5000M
/:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=tegra-xusb/5p, 480M
    |__ Port 2: Dev 2, If 0, Class=Hub, Driver=hub/4p, 480M
        |__ Port 2: Dev 3, If 0, Class=Vendor Specific Class, Driver=cp210x, 12M

I do not really understand how they are connected.
I recognize the following port because I have attached an ESP32 device just for the purpose of power supply.
|__ Port 2: Dev 3, If 0, Class=Vendor Specific Class, Driver=cp210x, 12M

I will try to replace the USB3.0 SATA Adapter as first chaging, and I will let you know.
Thanks again!!!

Is that temperature graph the temperature of the NVMe? Or of the USB adapter between the NVMe and the Jetson? Those are the temperatures in question.

Looks like the NVMe is USB3, and there is only one USB3 root_hub, and so you can change port (which is still a good test), but the same root_hub will get reused each time (so you can’t expect that to change). Do try from a different port if you have one which will work with USB3. Perhaps switching to USB2 (which would simply be by using a port not set up for USB3…the NVMe would revert to USB2 and migrate to the other root_hub) would slow things down, but might still be a good test (if the system still reboots with that error when in USB2 mode, then you know there is a serious error, although lack of reboot wouldn’t be definitive of what is wrong).

The ESP32 won’t have anything to do with this particular error.

Replacing the USB3 SATA adapter is a good test as well.

Hi @linuxdev , yes you are perfectly right with the temperature.
The one I am collecting is the temperature of the GPU and it gave me a “stress index” during inference running.

I have performed the first test.
The system lasted 4 days.
Then I got an error and part of the functionality stopped (e.g. grafana) but no reboot was triggered and I was still able to access the board via SSH.

I have reported here below the serial log

[01:04:06.833256 0.000010] 1803] EXT9mmcb(p1rroece last fs6
[0921] EXT4-fs (m0p1): at time 159776rblock_to_tree:de 46493: block 15920rror at time 1597767749: htree_dir1: inode 285072:[165.125 lkudtreuest: I/da, sector 537550.716667] EXTdevice sda1):  writing to inode 11406765 (offse72032 startin380)
[01:32:32.401259 1705.625858] [or on device sda1, logical block 6736638] Buffer I/ce sda1, logical 12922] Buffer I/O error logical block 6719201] Buffer I/O  sda1, logical bl6550.755477] Bufn device sda1, l361798] Buffer I/O error on devicelock 6713086
[01:32:32.660098 0.268985] [1ffer I/O error onogical block 6713360] Buffer I/O eda1, logical bloc550.780652] Buffecal block 6713089
[01:32:32.792895 0.132802] [186550.786938ror on device sda 6713090
[01:32:32.837103 0.044207] [186550etected IO errors file data on sda-8
[01:32:38.153988 5.316575] 16556.694059] bl tor 232056376
[01:32:38.212091 0.058412] [] Aborting journ1-8.
[01:32:50.154193 11.942091] [856897_update_request: I/O error, dev sda, k_update_request: sda, sector 2048] Buffer I/O erroogical block 0, lrite
[01:32:50.475112 0.320932] [186568.71r): ext4_journal_check_start:56: Deournal
[01:32:50.596270 0.121157] [186568. s I/O error to sued
[01:32:50.693953 0.097683] [186568.7173equest: I/O erroror 230950912
[01:32:50.779445 0.085492] [18k_update_request:I/O error, dev 950912
[01:32:5717352] Buffer I/da1, logical bloc sync page write8] JBD2: Erro-5 detected whnal superblock f568.760881] EXT4tem read-only
[01:33:02.154287 11.781667] [15.956 lk_updaerror, dev sda, 86580.703609] blk sda, sector 2048
[01:33:02.272186 0.117900] [186580.709608]r, logical blockpage write
[01:33:02.329435 0.057247] [1EXT4-fs error (ext4_journal_chDetected aborte580.717012] EXT4-ous I/O error tocted
[01:33:02.445111 0.115679] [186580.7(sda1): ext4_writrt: 13312 pages, r -30
taino 11406765; er

(I do not understand why the tool I am using is saving the serial log with mistakes GitHub - tbird20d/grabserial: Grabserial - python-based serial dump and timing program - good for embedded Linux development . Is it possible to reduce the baudrate of the serial console in the Jetson Nano?)

(I forgot to mention that the first test consisted in replacing the USB3-SATA adapter…)

Thanks again!

Regarding errors, there is a good possibility that the timing of data on the UART is inconsistent due to errors and/or load. I doubt reducing baud rate would help. Seeing this with unreliable output is in itself an interesting debug tool. Normally it occurs when there is noise present, e.g., wires too long or introduced from noise in the power supply. I think in this case it is due to timing being off while recovering from errors.

The I/O errors would be either the USB3 adapter, or else the disk itself. If you’ve swapped with another USB3 adapter, and errors don’t change, then it is looking like the disk is failing. Do you have another hard drive you could substitute as a test while using that USB3 adapter?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.