Jetson Nano crashes after 3 to 10 days of operations - continued

• Hardware Platform (Jetson / GPU) Jetson Nano
• DeepStream Version 5.0
• JetPack Version (valid for Jetson only) 4.4-b144
• TensorRT Version 7.1.3
• Issue Type( questions, new requirements, bugs) question

Dear All,

i have been looking into this problem for some time now.
Past reference here:
https://forums.developer.nvidia.com/t/jetson-nano-crashes-after-3-to-10-days-of-operations/230912/1

Problem: after a certain amount of hours with the system running, the Jetson Nano just hang-up without any life signal.
As suggested by @linuxdev I have setup a serial console continuously logging the kernel log on a separate device.
I have been logging the behavior of Jetson Nano for weeks now, and lately I got this message:

[20:49:54.662037 0.187796] [13696.763832] blk_update_request: I/O error, dev sda, sector 388862096
[20:49:54.729131 0.067092] [13696.770270] EXT4-fs warning (device sda1): ext4_end_bio:313: I/O error -5 writing to inode 11409300 (offset 9515008 size 192512 starting block 48607810)
[20:49:54.839780 0.110651] [13696.783973] Buffer I/O error on device sda1, logical block 48607506
[20:49:54.927229 0.087445] [13696.790311] Buffer I/O error on device sda1, logical block 48607507
[20:49:54.991904 0.064680] [13696.797228] Buffer I/O error on device sda1, logical block 48607508
[20:49:55.036680 0.044775] [13696.803603] Buffer I/O error on device sda1, logical block 48607509
[20:49:55.081015 0.044336] [13696.809897] Buffer I/O error on device sda1, logical block 48607510
[20:49:55.136984 0.055968] [13696.816218] Buffer I/O error on device sda1, logical block 48607511
[20:49:55.181459 0.044474] [13696.822530] Buffer I/O error on device sda1, logical block 48607512
[20:49:55.231958 0.050500] [13696.828865] Buffer I/O error on device sda1, logical block 48607513
[20:49:55.276356 0.044397] [13696.835165] Buffer I/O error on device sda1, logical block 48607514
[20:49:55.320482 0.044128] [13696.841461] Buffer I/O error on device sda1, logical block 48607515
[20:49:55.367289 0.046805] [13696.847907] sd 0:0:0:0: rejecting I/O to offline device
[20:49:55.413999 0.046712] [13696.853190] blk_update_request: I/O error, dev sda, sector 231106960
[20:49:55.458451 0.044451] [13696.859551] Aborting journal on device sda1-8.
[20:49:55.489571 0.031121] [13696.859816] sd 0:0:0:0: rejecting I/O to offline device
[20:49:55.526986 0.037414] [13696.859851] sd 0:0:0:0: rejecting I/O to offline device
[20:49:55.563795 0.036810] [13696.859876] sd 0:0:0:0: rejecting I/O to offline device
[20:49:55.600220 0.036424] [13696.879723] sd 0:0:0:0: rejecting I/O to offline device
[20:49:55.645770 0.045551] [13696.884970] JBD2: Error -5 detected when updating journal superblock for sda1-8.
[20:49:55.698504 0.052733] [13696.892994] sd 0:0:0:0: rejecting I/O to offline device
[20:49:55.737519 0.039015] [13696.898288] sd 0:0:0:0: rejecting I/O to offline device
[20:49:55.774172 0.036653] [13696.905886] sd 0:0:0:0: rejecting I/O to offline device
[20:49:55.810654 0.036482] [13696.911160] sd 0:0:0:0: rejecting I/O to offline device
[20:49:55.847562 0.036904] [13696.916525] sd 0:0:0:0: rejecting I/O to offline device
[20:49:55.890523 0.042965] [13696.921814] sd 0:0:0:0: rejecting I/O to offline device
[20:49:55.927643 0.037120] [13696.927210] sd 0:0:0:0: rejecting I/O to offline device
[20:49:55.967215 0.039572] [13696.933031] sd 0:0:0:0: rejecting I/O to offline device
[20:49:56.004190 0.036976] [13696.961045] sd 0:0:0:0: rejecting I/O to offline device
[20:49:56.041395 0.037203] [13696.966364] EXT4-fs error (device sda1): ext4_find_entry:1441: inode #11406967: comm systemd: reading directory lblock 0
[20:49:56.118189 0.076796] [13696.977263] sd 0:0:0:0: rejecting I/O to offline device
[20:49:56.155175 0.036987] [13696.991878] sd 0:0:0:0: rejecting I/O to offline device
[20:49:56.191654 0.036477] [13697.151752] EXT4-fs (sda1): previous I/O error to superblock detected
[20:49:56.238201 0.046548] [13697.152823] sd 0:0:0:0: rejecting I/O to offline device
[20:49:56.275002 0.036801] [13697.163532] sd 0:0:0:0: rejecting I/O to offline device
[20:49:56.311506 0.036504] [13697.168776] EXT4-fs error (device sda1): ext4_journal_check_start:56: Detected aborted journal
[20:49:56.378052 0.066544] [13697.177393] EXT4-fs (sda1): Remounting filesystem read-only
[20:49:56.417703 0.039652] [13697.183021] EXT4-fs (sda1): previous I/O error to superblock detected
[20:49:56.463090 0.045388] [13697.189481] sd 0:0:0:0: rejecting I/O to offline device
[20:49:56.500068 0.036977] [13697.197253] sd 0:0:0:0: rejecting I/O to offline device
[20:49:56.537074 0.037007] [13697.202560] sd 0:0:0:0: rejecting I/O to offline device
[20:49:56.580377 0.043302] [13697.210772] sd 0:0:0:0: rejecting I/O to offline device
[20:49:56.617541 0.037164] [13697.216694] EXT4-fs error (device sda1): ext4_find_entry:1441: inode #661941: comm (start.sh): reading directory lblock 0
[20:49:56.695350 0.077809] [13697.228164] EXT4-fs (sda1): previous I/O error to superblock detected
[20:49:56.741109 0.045759] [13697.234990] sd 0:0:0:0: rejecting I/O to offline device
[20:49:56.777717 0.036609] [13697.401042] sd 0:0:0:0: rejecting I/O to offline device
[20:49:56.814610 0.036892] [13697.406341] sd 0:0:0:0: rejecting I/O to offline device
[20:49:56.851842 0.037231] [13697.650987] sd 0:0:0:0: rejecting I/O to offline device
[20:49:56.891097 0.039256] [13697.656293] sd 0:0:0:0: rejecting I/O to offline device
[20:49:56.928077 0.036977] [13697.900992] sd 0:0:0:0: rejecting I/O to offline device
[20:49:56.964913 0.036840] [13697.906292] sd 0:0:0:0: rejecting I/O to offline device
[20:49:57.001444 0.036531] [13698.151151] sd 0:0:0:0: rejecting I/O to offline device
[20:52:01.595446 124.593999] [13823.874271] sd 0:0:0:0: rejecting I/O to offline device
[20:52:01.664587 0.069142] [13823.879596] sd 0:0:0:0: rejecting I/O to offline device
[20:52:01.708125 0.043539] [13823.885126] sd 0:0:0:0: rejecting I/O to offline device
[20:52:01.744864 0.036740] [13823.890424] sd 0:0:0:0: rejecting I/O to offline device
[20:52:02.362951 0.618081] [13824.641577] sd 0:0:0:0: rejecting I/O to offline device
[20:52:02.459365 0.096351] [13824.646834] sd 0:0:0:0: rejecting I/O to offline device
[20:52:02.543726 0.084370] [13824.652107] sd 0:0:0:0: rejecting I/O to offline device
[20:52:02.653080 0.109350] [13824.657355] sd 0:0:0:0: rejecting I/O to offline device
[20:52:02.774847 0.121774] [13824.662606] sd 0:0:0:0: rejecting I/O to offline device
[20:52:02.831484 0.056693] [13824.667846] sd 0:0:0:0: rejecting I/O to offline device
[20:52:02.872463 0.040978] [13824.673206] sd 0:0:0:0: rejecting I/O to offline device
[20:52:02.924369 0.051908] [13824.678507] sd 0:0:0:0: rejecting I/O to offline device
[20:52:02.960816 0.036447] [13824.683763] sd 0:0:0:0: rejecting I/O to offline device
[20:52:02.997673 0.036856] [13824.689007] sd 0:0:0:0: rejecting I/O to offline device
[20:52:03.034524 0.036851] [13824.698270] sd 0:0:0:0: rejecting I/O to offline device
[20:52:03.072715 0.038191] [13824.703880] sd 0:0:0:0: rejecting I/O to offline device
[20:52:03.109458 0.036744] [13824.713690] sd 0:0:0:0: rejecting I/O to offline device
[20:52:03.146098 0.036639] [13824.722746] sd 0:0:0:0: rejecting I/O to offline device
[20:52:03.182899 0.036801] [13824.722803] sd 0:0:0:0: rejecting I/O to offline device
[20:52:03.220501 0.037602] [13824.729890] sd 0:0:0:0: rejecting I/O to offline device
[20:52:03.257132 0.036631] [13824.731424] sd 0:0:0:0: rejecting I/O to offline device
[20:52:03.294017 0.036886] [13824.740910] sd 0:0:0:0: rejecting I/O to offline device
[20:52:03.330460 0.036443] [13824.744429] sd 0:0:0:0: rejecting I/O to offline device
[20:52:03.370139 0.039679] [13824.744463] sd 0:0:0:0: rejecting I/O to offline device
[20:52:03.406781 0.036642] [13824.744473] sd 0:0:0:0: rejecting I/O to offline device
[20:52:03.444479 0.037697] [13824.744488] sd 0:0:0:0: rejecting I/O to offline device
[20:52:03.481417 0.036938] [13824.744712] sd 0:0:0:0: rejecting I/O to offline device
[20:52:03.518415 0.036999] [13824.744730] sd 0:0:0:0: rejecting I/O to offline device
[20:52:03.555077 0.036661] [13824.744803] sd 0:0:0:0: rejecting I/O to offline device
[20:52:03.591739 0.036662] [13824.745036] sd 0:0:0:0: rejecting I/O to offline device

I am currently booting the system from a USB3 SSD drive as explained here:

Do you think a disk change would solve the problem?
(I have already tried to change the sata-to-USB3 adapter without solving the issue apparently)

Thanks a lot!

/dev/sda” seems to be failing. It isn’t any specific memory location, as you can see it stops working and then it increments through logical blocks one after another. You said this is USB, but I don’t see USB errors listed, and so it points to the drive itself. However, do those errors go away completely for some time if the system cools down? This might be heat related, especially if it is solid state and not old school mechanical.

One thing to worry about is that the first time it writes like this, especially since it is mentioning the journal is involved, is that all of the content could be corrupted. Hopefully it rejected the device and recognized the problem before anything was actually written to disk.

Does non-write operation still work? For example, the “uname -r” command? I’m hoping that the “smartctl” command works, but it has to read that from the disk. If read works, then the disk can be queried for temperature and other errors. “smartctl” will be useful though even before the error. If you don’t have this, and assuming the system works, “sudo apt-get install smartmontools”. Then, early in boot before there are such errors (verify this isn’t happening right after boot, check for error in dmesg), do the following in serial console or in ssh since you don’t want it to require using this disk:

sudo smartctl -a | tee log_smartctl.txt
dmesg | egrep '(error on device|offline device|journal|I.O error|sda)' | tee log_dmesg.txt

This will create a log of that command. If you were to run that command more than once though with the “tee” like that, then it would overwrite it. If you make a second run of the command, then perhaps increment a number in the log file name, e.g.:

sudo smartctl -a | tee log_smartctl_2.txt
dmesg | egrep '(error on device|offline device|journal|I.O error|sda)' | tee log_dmesg_2.txt

The goal is to get a log when the drive is cooled down and not in error, followed by a log immediately after you first notice the error. “S.M.A.R.T.” capable drives (and most drives now are capable of this) have some internal memory for recording error conditions and keep a history as a means of detecting drive failure. USB issues probably wouldn’t be recorded, but issues internal to the drive itself will likely show up. You could start by posting the “good and still works” log, and then post again later when you get it to show up on a serial console or ssh.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.