Random files appear in the home folder (and bash history corrupted)

cenit · July 11, 2018, 6:17am

We have 5 Jetsons TX2 deployed in the fields for some CV tasks, using IP cameras directly connected to each of them and a Huawei modem for each on a usb port to send us results.
Without any way (as of now, this is why I am asking some help) to determine the root cause, we suffer from random files appearing in the user home folder (it can happen once per day and then this problem can disappear for a week), with unreadable filenames (a variable length stream of char like this: ??u?M?M%?[???I???]b??ݜ??^???ک?????F???Q???#, just to give you an idea). From time to time, also the bash history gets corrupted, with similar streams of chars. Even if less frequently, we can experience a hang or even a shutdown, without any apparent reason. Since the devices are not in a laboratory but in the open field, the easy way for us to restart is just removing the power and restoring it, going to each location (not so efficient, in particular since we want to expand our installation base).
Of course temperatures are under control and the setup should be ok.
What is the best approach to understand what’s going on? We don’t have a lot of hardware or low-level embedded experience, but of course we should be able to implement all the good suggestions :)
Oh, before I forgot: of course it could be our software running producing this problem (as external libraries we use only cuda/cudnn and opencv), but it would not be so easy to understand why our software is doing this: we don’t write any file to home folder, and also we cannot and have never been able to reproduce this behaviour on any other architecture, nor we found any way to trigger it deterministically on the TX2…
Thanks!

linuxdev · July 11, 2018, 9:50am

Cutting power (if it is still running) instead of a proper shutdown can make corruption issues worse. On the other hand, if you use “ls -l whatever_file_it_is”, then you’ll see a timestamp. Run dmesg and see if there is anything going on very close to that timestamp. Also, run “df -H -t ext4” and see if something is filling up.

cenit · July 11, 2018, 10:18am

Unfortunately we don’t have any other option apart from a power cut. Devices do not respond via ssh and so that’s our only option. We are not very intensive i/o, so I don’t think that this corruption can really be related to cutting power.
Also, files appear at random times, and luckily hangs or shutdown are MUCH much less frequent than the appearance of these files (with zero size, so no worry about occupancy!)…

I will try to analyse dmesg for correlations, thanks for the suggestion. For now I am stuck, because we had a planned reboot and now the command

journalctl -k -b -1

gives me this output:
Specifying boot ID has no effect, no persistent journal was found

First I need to enable persistent journals :)
edit: did it with
sudo mkdir -p /var/log/journal
sudo systemd-tmpfiles --create --prefix /var/log/journal
sudo systemctl restart systemd-journald

linuxdev · July 11, 2018, 7:25pm

If you are physically there, then I’d suggest using a serial console. This would also give you information about what is going on prior to cutting power for reset…serial console will usually work even when a large part of the system is failing (you might even be able to see dmesg which has not yet made its way to disk). See:
[url]http://www.jetsonhacks.com/2017/03/24/serial-console-nvidia-jetson-tx2/[/url]

In your case this would be more important since it would possibly give you information prior to cycling power. And if serial console works, then you could also run “sudo shutdown -r now” to reboot.

If anything was wrong with the file system, then each time there is a journal replay you will lose part of your files if they are not yet written. In extreme conditions the journal may not even be enough to keep the file system “consistent”. A “consistent” file system is one where the structure is valid (for example you don’t have two files each mistakenly believing an inode belongs to them…and thus overwriting each file’s change). The fact that you see nothing from logs via the journalctl does not surprise me since these would not have been written/flushed until shutdown…any disk failure would prevent the write…and even if logs were partially written, then booting back up and having ext4 journals play would cause the loss of anything not validated.

If you cannot use a serial console, and if this is important enough, then you might want to clone the rootfs after a failure (prior to booting again…the journal would wipe out some information), and then explore the clone on a host PC. Sadly, there is no way to get around a clone taking a long time to complete (perhaps a couple of hours?).

One thing I would want to know if you are able to check it near the point of failure (e.g., just before, or even after) is how much memory is used, both RAM and disk. “free -m” would show some of this in MB. “df -H -t ext4” would show any ext4 partition free memory in GB.

Will you be able to use a serial console?

cenit · July 14, 2018, 8:10am

We are not physically there unfortunately. I am trying to setup another device in lab, hoping to reproduce what’s going on in the field and also have a serial console as you described.
But this is what happened very recently on one device. Note that it did (and is still doing, fingers crossed) his job brilliantly during the whole week, without any noticeable problem (without logging in I would not have seen anything wrong!) and that we reboot it programmatically each night at 4 am (we tried this way to see if it helped, but it seems it does nothing).
Anyway, when l logged in I found this:

nvidia@jetson2:~$ ll
total 272
drwxr-xr-x 25 nvidia nvidia   4096 Jul 14 04:00 ./
drwxr-xr-x  4 root   root     4096 Jan  6  2017 ../
-rw-rw-r--  1 nvidia nvidia      0 Jul 13 13:39 8?I??ȏ?F??먧?U宅G??Y??ش+T??
-rw-------  1 nvidia nvidia   8422 Jul 14 04:00 .bash_history
-rw-r--r--  1 nvidia nvidia    220 Jun 24  2016 .bash_logout
-rw-r--r--  1 nvidia nvidia   3771 Jun 22 15:39 .bashrc
...
drwxr-xr-x  2 nvidia nvidia   4096 May  6  2016 Music/
-rw-rw-r--  1 nvidia nvidia      0 Jul 12 11:30 :?????m~???v??Un??
drwxrwxr-x  2 nvidia nvidia   4096 May 11 10:11 .nano/
...

and this is the last part of the history (cropped just to give you an idea of the mess):

84  ?ˇ? ?u_횚ZCrֲٖ߭?????@???%m?ǂ??S?G?????j???? ?O????R?F??\/??
   85  ???/?????K?־??zI??cZ??s?g?2Ry??p??z??z?_
   86  ??5?x9??VoH??\*;?l???s??????G?Vu????????e~??5??Q??t(?7Ơ????W??8Q??x6"?2??ML͙Vx???W?>?9??c?
   87  ???'?i%?-??W?ť??b???7=?????rvY?K??\????,?F???wY?͙|?f_?a;????gV'U??&?{?Fq"?G5S?V??t?V????
   88  ???Tm????uM?YM????pzʻ??M?\?A?]?YM????F??\?bj?r??I??ut?J???v?q9????3?????;??}e??????o0??h?2i??q???̑??Q???=??ƨ?8?h??&9??2F????7????{?J???w????^?޼?Z=??5?䙑h?????fXtme??O????]U
   89  ?2?jrO]]???ͷ?????:.q?4-???gH??[??$??,O=?Ś?o)??災???a7NŢ?o??>???f?r0?'\??KgR??q?_ ?r?????T????U??C???xE?;q?^??Վ[??3??VG??
   90  ?2?jrO]]???ͷ?????:.q?4-???gH??[??$??,O=?Ś?o)??災???a7NŢ?o??>???f?r0?'\??KgR??q?_ ?r?????T????U??C???xE?;q?^??Վ[??3??VG???ӻ???ePO3?ϭ?'y????@~?1????#O
   91  k
   92  k???f????~N%?n?????q?oz?=8?4?}?f-l?????????
   93  #?????A??ڡ^?????
   94  a?^(F?h?????s?>??օʕ?<ߚ?t\k?K9??+?Z??????~?Ȝ5'
   95  Ë???????????%OT~Z?rҩ?2???U??:r????n?"~?x??????O?????F???m??
   96  ?8??9I??????????^?????y?a???ڥ?t??????f?h?i?h[?8k?
   97  K??1??
   98  ll
   99  journalctl -k -b -1
  100  ll
  101  history
nvidia@jetson2:~$

No other files apparently have been damaged (but I am starting to think that the history is not damaged, it is just recording ‘real’ messy commands made by ??), and the device, as I was saying, is working perfectly without any power cut or anything. This is the most common situation, luckily, but still we cannot understand it while we would like to have the system fully under control.

Since I enabled the persistent journals, finally I was able to do
journalctl -k -b -2
and
journalctl -k -b -1
since the two files were created yesterday and two days ago.

I found really few messages that can be related. One is exactly at the same time, the other not. They are cropped between the final boot message and the first next message, which in the first case is a usb problem with the modem usb key , and the second one is the shutdown message of the day after, so everything went perfectly fine for the log…?)

Jul 12 04:00:44 jetson2 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
Jul 12 11:30:04 jetson2 kernel: sysrq: SysRq : HELP : loglevel(0-9) reboot(b) crash(c) terminate-all-tasks(e) memory-full-oom-kill(f) kill-all-tasks(i) thaw-filesystems(j) sak(k) show
Jul 13 03:44:19 jetson2 kernel: usb usb1-port2: disabled by hub (EMI?), re-enabling...

Jul 13 04:00:53 jetson2 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
Jul 13 17:32:29 jetson2 kernel: sysrq: SysRq : HELP : loglevel(0-9) reboot(b) crash(c) terminate-all-tasks(e) memory-full-oom-kill(f) kill-a
Jul 14 04:00:01 jetson2 kernel: tegradc 15210000.nvdisplay: blank - powerdown

I don’t like these message about SysRq, it looks like something triggered a malformed SysRq…???
What can I log to better diagnose what made this mess (which luckily was non-destructive)?

linuxdev · July 14, 2018, 8:23pm

Before we look at the technical issues there is one question which needs to be asked: Since the default password is known as a standard Ubuntu thing all over the world, have you changed the password, and are you sure the script kiddies can’t just log in?

For reference, if I say “ALT-SYSRQ-something”, I mean using the ALT key and “SYSRQ” key, then tapping another key (sort of like “CTRL-ALT-del”, but with “ALT-SYSRQ-someOtherKey”). The SYSRQ key is the same key as “PrtScn”…PrtScn key is the “shift” of SysRq key. “ALT-SYSRQ-s” implies hold ALT key down, now hold down SYSRQ, then tap the “s” key (then let go of all keys). You can browse this online by searching “linux magic sysrq”. Wikipedia has a good section on it, but covers more than just Linux (e.g., Solaris, UNIX, FreeBSD). If you are developing on a system where it might go unresponsive learning about magic sysrq could be quite useful. Lower level parts of the kernel are usually still running even when much of the outside world is in trouble, and sysrq goes directly to those lower level inner workings.

You’ll normally see that sysrq message at certain times when connecting via serial console (depends on stray characters in the serial UART, e.g., if your serial console program thinks it is talking to a modem it might send a modem init string which a regular console doesn’t understand). That message is not necessarily a problem, but some keyboard sequence might have been detected on stdin which isn’t normally used in a regular console, and in turn triggers the display of the sysrq help message. My “feel” for this particular message is that message might imply a stray non-printing character has triggered it, but this is only a sign of something randomly echoing random bytes to stdin.

Just FYI, sysrq is a “good thing” if you are a developer. What it does is give access to certain functionality from a local console even if the system is in trouble. Certain key combinations, if enabled (and they are by default, but this can be configured through “/etc/sysctl.conf”), can trigger useful events for controlling an otherwise dead system. For example, on your Linux PC (or even a local Jetson), you could type “ALT-SYSRQ-s”, and if monitoring “dmesg --follow”, then you’d see the system did an emergency disk sync (not recommended on eMMC unless you really are about to do something which might corrupt the disk, e.g., shutdown by yanking the power cord…eMMC has wear leveling for longer life and constantly sync’ing will decrease eMMC life). If I get an unresponsive system I might use:

# sync
ALT-SYSRQ-s
# Remount disks read-only
ALT-SYSRQ-u
# Immediate reboot
ALT-SYSRQ-b

This is basically a mapping of certain key bindings which can trigger echo of the correct text to “/proc/sysrq-trigger”. You can perform the same commands by redirecting text to “/proc/sysrq-trigger” instead of using key bindings. The key and what is echoed is not always the same, but as an example you can do an emergency “sync” like this over serial console (serial console has no key bindings to sysrq so you must bypass the key binding detection…serial console is a sequence of bytes and has no key scan codes…try this on serial console to see how it works):

sudo echo 's' > /proc/sysrq-trigger

Trivia: This is the basis of using kgdb for software based kernel debugging over serial console with gdb running on a separate host PC.

If you echo something which isn’t valid in such a way that sysrq sees it, then it will print that help message. Each key on the keyboard has a scan code, and if random data happens to pass through a program’s command line I/O with the “ALT-SYSRQ-something” combination it could trigger that message. Think of it as being the same as triggering “–help” in many commands if you enter an argument to the command which the program does not understand…the program responds with its help message. You have stray data going on in I/O which is mistaken for a key stroke to be sent to “/proc/sysrq-trigger”. It isn’t sysrq which is the problem, the problem is how this random data ended up in stdin/std::cin.

This looks important:

kernel: usb usb1-port2: disabled by hub (EMI?), re-enabling...

EMI would be something to consider. EMI can trigger stray data…so can a brownout with momentary partial drop of power since RAM becomes unstable without actually crashing. I’m not sure what noise sources are out there, nor what kind of shielding or power isolation you might have (driving servo motors from the same source which runs the Jetson is a big mistake which can do this…no inductive or rapidly changing load should exist which directly touches the Jetson’s power source).

What is on that USB port it shows as being re-enumerated? What kind of wiring quality and shielding and isolation from the world would that USB device and HUB have? If this is a modem type device the moment of failure could result in something odd going on in the environment. If this device is being used by redirecting or piping data I could see this as causing random data to stdin. I could even see this as containing a “>” in the stream which causes it to create files randomly in the directory it currently runs in. Whatever your program is which runs, you might test by having it first “cd” to some empty location, e.g.:

mkdir ~/test_dir
cd ~/test_dir
# ...run your program as normal, see if you now get random files to this subdirectory...if so, then this program is at fault, but perhaps due to your USB pipe feeding it invalid data.

…basically it just says that if you can take everything you run as a custom program and get each component to run in a separate clean directory, then upon crash, if you get stray files, then you know the program which created them. The actual cause might be that program, or it might be the data fed to that program, e.g., if the program does not properly escape certain characters because you did not expect them to occur, but USB reenumeration causes “not so normal characters” to be fed to the program, then the expectations break.

FYI, if you run “lsusb”, and then “lsusb -t”, you can expect the tree view to have the “Bus” and “Dev” number matching the non-tree view. Identify what device was being re-enumerated.

cenit · July 14, 2018, 9:37pm

Dear @linuxdev,
thank you very much for taking a big chunk of your time to write such a long and detailed answer to me. Your reply was a delight to read!
First of all, let me say that a) we are natted and behind an openvpn connection, and ssh is configured with key-only llogin… this should keep random script kids far far away from them, b) I know some magic SysRqs, and in fact I was scared something was triggering them (but again, I never disabled them so they were still listened by the kernel), but thanks for the quick recap.
Sorry for being maybe a little bit confusing in the way I am explaining the problem, but I agree with you that the problem are not the sysrq themselves, just the fact that something is triggering them, and this is, as you say, related to understand how is it possible that random data is entering stdin…
Thank you also for putting more light on a problem I was almost neglecting, which is the usb re-enumeration.
In fact, I skipped the EMI part so quickly that I did a big mistake. To be honest, I am not even sure about that acronym… is it ElectroMagnetic Interference? Really??
In that case, let me explain our setup. We take the 220V from some shared sockets which are totally out of our control. It is just a power source, and we hope it’s good enough. This original AC 220V goes into an AC/DC power supply that outputs 12V DC 4A max, which are shared between a Jetson and an IP camera in all the installation points. This setup is replicated identically (1 power supply, 1 jetson and 1 camera) in multiple installations, and each one can manifest and had manifested problems like those without any previous notice (and most of the time, without destructive effects - no forced reboot necessary, just some shitty files in the home folder and in the history).
Those cameras have servos inside, as they are set up to have automatic focus, so servos can start by themselves to move lenses whenever they want…
And yes, the USB device is a 3g modem, which we use to periodically rsync data out of the box.
Good idea to run each executable from a different run folder, I will try this asap to cross-check if it’s really the rsync that is causing all these problems…
Thanks for these helpful hints!!

linuxdev · July 14, 2018, 11:00pm

EMI does mean “electromagnetic interference”, but the driver only knows what it sees so it can’t produce any useful information on what was causing the problem. EMI is just a suggestion the driver is mentioning because it is the most general way of describing degraded data without suggesting a specific cause. A better description might be if the driver had said “the signal has become distorted and the clean and open eye pattern is now either closed or has too much jitter” (but I’m sure most people would read that in the logs and get even less meaning from the message). Hearing “EMI” tends to make people think of an outside noise source, e.g., next to a radio transmitter, but anything modifying the differential pair of the USB will do this.

If a USB HUB or PHY gets power from an unstable power supply, then this could be considered EMI. If a motor which is unrelated to the system but nearby causes radio interference, this is EMI. If your power source also powers a motor directly connected, and causes the power supply to slightly fluctuate you wouldn’t normally call that EMI, but it is (the source of distortion changes, but the end result is the same). Bottom line is that you need to find out what is on that USB port, and then figure out what could interfere with the device…for example if there are exposed wires which are not twisted pair or if power is noisy due to servo motors. Even if you have several twisted pair wires of good quality there is still a problem if the twisting of each pair is the same pitch…each pair needs a different twist pitch so that those specific wires don’t interfere with each other (when the twist of two pairs match the noise cancelling actually cancels between those two pairs…two matched twisted pair are the equivalent of being protected from outside EMI, but not from the two pairs cross-talking).

If that USB device is providing any sort of data, and if that data is being piped in such a way that it is vulnerable to some possibility of sending to stdin under failure conditions, then this could quite possibly be the cause of the mysterious files. On the other hand, you’d still need to know why the USB is dropping out. Bad shielding? Power supply insufficient? Power line noise? Who knows…speculating is needed.

If it is a power supply issue when servo motors kick in, then you might be able to add extra capacitors to the point of delivery to the Jetson. A high quality lower capacitance tantalum (e.g., 0.1uF), perhaps a second somewhat higher capacitance tantalum (e.g., 1uF)…then a very large electrolytic (e.g., 10,000 uF), all placed as close as possible to the Jetson. Then making sure any power cable is itself a twisted pair and of a thicker gauge than seemingly needed (e.g., you might think 22 awg is ok, but try 18 awg).

It sounds like the same 12V source is powering the servos. You might try (even if it is temporary as an experiment) to place the camera and its servos on a separate power supply for further power isolation.

If you are thinking about EMI for USB, then you might be interested in knowing more about the “eye diagram” a balanced pair line will use. The data lines of PCIe are also balanced pairs with an eye diagram being relevant (what you see about design challenges of PCIe PCBs will be the same as the USB data pair issues), and anything causing the eye pattern to close or jitter can be the cause of the USB disconnect regardless of whether you call it EMI or a closed eye pattern or a noisy signal. Here a URL on eye patterns if interested in that:
[url]#141: What is an Eye Pattern on an Oscilloscope - A Tutorial - YouTube

A URL for PCIe (a bit dry at times, but the eye pattern information is good and could be considered combating EMI):
[url]PCI Express Physical Layer - YouTube

cenit · August 6, 2018, 2:53pm

double posting, sorry, message removed

cenit · August 6, 2018, 2:55pm

Dear linuxdev,
sorry for the very late reply. Thanks again for your help.
Unfortunately in the setup I prepared inside the lab nothing is going wrong. It is up and running since a long time without any problem at all, so it is not being useful to diagnose the problem appearing on the other devices…

Since the last discussion we had, I had the opportunity to modify the power supply of each device in the field. Now each jetson tx2 has its own power supply, outputting 12V that are not shared with any other device (they were shared with a camera before). I really hoped for this rework to benefit our symptoms but to be honest, nothing changed.

This is an example of what’s still going on (let’s remember that every night at 4am all devices are rebooting). Many files appear in the home folder, and the history itself is full of garbage

nvidia@jetson5:~$ journalctl -k -b -1
...
Aug 05 04:00:31 jetson5 kernel: IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
Aug 05 04:00:32 jetson5 kernel: scsi 3:0:0:0: Direct-Access     HUAWEI   TF CARD Storage  2.31 PQ: 0 ANSI: 2
Aug 05 04:00:32 jetson5 kernel: sd 3:0:0:0: [sda] Attached SCSI removable disk
Aug 05 04:00:41 jetson5 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
Aug 05 04:00:41 jetson5 kernel: cdc_ether 1-2:1.0 eth1: kevent 12 may have been dropped
Aug 05 15:14:44 jetson5 kernel: sysrq: SysRq : HELP : loglevel(0-9) reboot(b) crash(c) terminate-all-tasks(e) memory-full-oom-kill(f) kill-all-tasks(i) thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l) show-memory-usage(m) nice
Aug 05 15:14:44 jetson5 kernel: ttyS0: 1 input overrun(s)
Aug 05 15:14:45 jetson5 kernel: sysrq: SysRq : HELP : loglevel(0-9) reboot(b) crash(c) terminate-all-tasks(e) memory-full-oom-kill(f) kill-all-tasks(i) thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l) show-memory-usage(m) nice
Aug 05 15:14:50 jetson5 kernel: sysrq: SysRq : HELP : loglevel(0-9) reboot(b) crash(c) terminate-all-tasks(e) memory-full-oom-kill(f) kill-all-tasks(i) thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l) show-memory-usage(m) nice
Aug 05 16:22:21 jetson5 kernel: sysrq: SysRq : HELP : loglevel(0-9) reboot(b) crash(c) terminate-all-tasks(e) memory-full-oom-kill(f) kill-all-tasks(i) thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l) show-memory-usage(m) nice
Aug 05 16:22:24 jetson5 kernel: sysrq: SysRq : HELP : loglevel(0-9) reboot(b) crash(c) terminate-all-tasks(e) memory-full-oom-kill(f) kill-all-tasks(i) thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l) show-memory-usage(m) nice
Aug 05 16:25:54 jetson5 kernel: sysrq: SysRq : HELP : loglevel(0-9) reboot(b) crash(c) terminate-all-tasks(e) memory-full-oom-kill(f) kill-all-tasks(i) thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l) show-memory-usage(m) nice
Aug 05 16:26:04 jetson5 kernel: sysrq: SysRq : Emergency Sync
Aug 05 16:26:04 jetson5 kernel: Emergency Sync complete
Aug 05 16:26:20 jetson5 kernel: sysrq: SysRq : HELP : loglevel(0-9) reboot(b) crash(c) terminate-all-tasks(e) memory-full-oom-kill(f) kill-all-tasks(i) thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l) show-memory-usage(m) nice
Aug 05 16:26:22 jetson5 kernel: sysrq: SysRq : Show Regs
Aug 05 16:26:22 jetson5 kernel: 
Aug 05 16:26:22 jetson5 kernel: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.38-tegra #7
Aug 05 16:26:22 jetson5 kernel: Hardware name: quill (DT)
Aug 05 16:26:22 jetson5 kernel: task: ffffffc00126e240 ti: ffffffc00125c000 task.ti: ffffffc00125c000
Aug 05 16:26:22 jetson5 kernel: PC is at cpuidle_enter_state+0xb8/0x2dc
Aug 05 16:26:22 jetson5 kernel: LR is at cpuidle_enter_state+0xb0/0x2dc
Aug 05 16:26:22 jetson5 kernel: pc : [<ffffffc00080c7ec>] lr : [<ffffffc00080c7e4>] pstate: 60000045
Aug 05 16:26:22 jetson5 kernel: sp : ffffffc00125fea0
Aug 05 16:26:22 jetson5 kernel: x29: ffffffc00125fea0 x28: ffffffc00125c000 
Aug 05 16:26:22 jetson5 kernel: x27: ffffffc000b6df00 x26: 000028b6fdd1af70 
Aug 05 16:26:22 jetson5 kernel: x25: ffffffc0013e7ec8 x24: 0000000000000000 
Aug 05 16:26:22 jetson5 kernel: x23: 000028b6fdd1b610 x22: 0000000000000000 
Aug 05 16:26:22 jetson5 kernel: x21: 0000000000000000 x20: 0000000000000000 
Aug 05 16:26:22 jetson5 kernel: x19: ffffffc1f666b518 x18: 00000000f490f430 
Aug 05 16:26:22 jetson5 kernel: x17: ffffffc000b6ea60 x16: 000000000000000e 
Aug 05 16:26:22 jetson5 kernel: x15: 0000000000079a50 x14: 00000012ffffffed 
Aug 05 16:26:22 jetson5 kernel: x13: 0000000000001227 x12: 0000000000000400 
Aug 05 16:26:22 jetson5 kernel: x11: 0000000000079ef2 x10: 00000000000008b0 
Aug 05 16:26:22 jetson5 kernel: x9 : 0000000100a9a030 x8 : ffffffc00126eb50 
Aug 05 16:26:22 jetson5 kernel: x7 : 0000000000000000 x6 : 0000000000ad5b98 
Aug 05 16:26:22 jetson5 kernel: x5 : 00000145cbdb7392 x4 : 00ffffffffffffff 
Aug 05 16:26:22 jetson5 kernel: x3 : 000000000fffee6b x2 : 00000001f5436000 
Aug 05 16:26:22 jetson5 kernel: x1 : ffffffc1f666c580 x0 : 0000000000000000 
Aug 05 16:26:22 jetson5 kernel: 
Aug 05 16:38:32 jetson5 kernel: sysrq: SysRq : Changing Loglevel
Aug 05 16:38:32 jetson5 kernel: sysrq: Loglevel set to 5
Aug 05 16:38:35 jetson5 kernel: sysrq: SysRq : HELP : loglevel(0-9) reboot(b) crash(c) terminate-all-tasks(e) memory-full-oom-kill(f) kill-all-tasks(i) thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l) show-memory-usage(m) nice
Aug 05 20:17:41 jetson5 kernel: sysrq: SysRq : HELP : loglevel(0-9) reboot(b) crash(c) terminate-all-tasks(e) memory-full-oom-kill(f) kill-all-tasks(i) thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l) show-memory-usage(m) nice
Aug 05 20:24:57 jetson5 kernel: sysrq: SysRq : HELP : loglevel(0-9) reboot(b) crash(c) terminate-all-tasks(e) memory-full-oom-kill(f) kill-all-tasks(i) thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l) show-memory-usage(m) nice
Aug 06 04:00:01 jetson5 kernel: tegradc 15210000.nvdisplay: blank - powerdown
Aug 06 04:00:01 jetson5 kernel: tegradc 15210000.nvdisplay: unblank
...

nvidia@jetson5:~$ ll
total 1532
drwxr-xr-x  25 nvidia nvidia  200704 Aug  6 14:06 ./
drwxr-xr-x   4 root   root      4096 Jan  6  2017 ../
-rw-rw-r--   1 nvidia nvidia       0 Aug  5 15:14 ?
drwxrwxr-x   2 nvidia nvidia    4096 Jul 23 10:19 App/
-rw-------   1 nvidia nvidia   20999 Aug  6 04:00 .bash_history
-rw-r--r--   1 nvidia nvidia     220 Jun 24  2016 .bash_logout
-rw-r--r--   1 nvidia nvidia    3771 Jun 24  2016 .bashrc
drwx------ 551 nvidia nvidia   20480 May 28 08:18 .cache/
drwx------   3 nvidia nvidia    4096 May  6  2016 .compiz/
drwx------  14 nvidia nvidia    4096 May  6  2016 .config/
drwxr-xr-x   2 nvidia nvidia    4096 May  6  2016 Desktop/
drwxr-xr-x   2 nvidia nvidia    4096 May  6  2016 Documents/
drwxr-xr-x   3 nvidia nvidia    4096 Jul 26 22:50 Downloads/
-rw-r--r--   1 nvidia nvidia    8980 Jun 24  2016 examples.desktop
-rw-rw-r--   1 nvidia nvidia       0 Aug  5 16:25 +??F֏?}-????q?N??]??M?I?~ķc??鷙d??W:ن?????^ߥ]?E؎kދʪ????
drwx------   2 nvidia nvidia    4096 May  7 19:35 .gconf/
drwx------   3 nvidia nvidia    4096 Aug  6 04:00 .gnupg/
drwx------   2 nvidia nvidia    4096 May 24 14:35 .gvfs/
-rw-------   1 nvidia nvidia 1104320 Aug  6 04:00 .ICEauthority
-rwxr-xr-x   1 nvidia nvidia   10219 Apr 24 18:16 jetson_clocks.sh*
drwx------   3 nvidia nvidia    4096 May  6  2016 .local/
drwxrwxr-x   2 nvidia nvidia    4096 Aug  6 16:37 log/
drwxr-xr-x   2 nvidia nvidia    4096 May  6  2016 Music/
drwxrwxr-x   2 nvidia nvidia    4096 Jun 28 15:36 .nano/
drwx------   4 nvidia nvidia    4096 May 13 18:28 .nv/
-rw-rw-r--   1 nvidia nvidia       0 Aug  5 16:26 ?[Nچ???Wǉ??FϞ?ш?^??????n??Gg??̵c??
-rw-rw-r--   1 nvidia nvidia       0 Aug  5 15:14 ?{???O??ۏ???t??Ư.??j????DL?p??#?ƕ?Ԓ?????xѝ??{?ϲ?e՞?_??~?M??do5ݾ_s???~?????E??x???mo?4a?J?GƉo?
drwxr-xr-x   2 nvidia nvidia    4096 May  6  2016 Pictures/
-rw-r--r--   1 nvidia nvidia     675 Jun 24  2016 .profile
drwxr-xr-x   2 nvidia nvidia    4096 May  6  2016 Public/
-rw-rw-r--   1 nvidia nvidia       0 Aug  5 16:22 ???q??????Z?j?7????
-rw-rw-r--   1 nvidia nvidia       0 Aug  5 16:25 rFa????_??Zá????Gw??
-rw-r--r--   1 nvidia nvidia       0 May 28 08:17 .selected_editor
drwx------   2 nvidia nvidia    4096 May 12 11:18 .ssh/
-rw-r--r--   1 nvidia nvidia       0 May  7 19:24 .sudo_as_admin_successful
-rwxr-xr-x   1 nvidia nvidia   39049 Apr 24 18:16 tegrastats*
drwxr-xr-x   2 nvidia nvidia    4096 May  6  2016 Templates/
drwxr-xr-x   2 nvidia nvidia    4096 May  6  2016 Videos/
drwxr-xr-x   2 nvidia nvidia    4096 Jul 20 15:26 .vim/
-rw-------   1 nvidia nvidia    4214 Jul 20 15:26 .viminfo
-rw-r--r--   1 nvidia nvidia    2850 Apr 24 18:16 weston.ini
-rw-------   1 nvidia nvidia     109 Aug  6 04:00 .Xauthority
-rw-rw-r--   1 nvidia nvidia       0 Aug  5 16:25 ^???X??_??o
-rw-------   1 nvidia nvidia     186 Aug  6 04:00 .xsession-errors
-rw-------   1 nvidia nvidia    1508 Aug  6 04:00 .xsession-errors.old
-rw-rw-r--   1 nvidia nvidia       0 Aug  5 16:25 ?}y

nvidia@jetson5:~$ history
...
  378  x????"??g?????q?#??
  379  ??^/??ȏA?u????:x? V?}\O???Su?W????F??6???<+???M~?9Tq?,?u??I???цZ??/?????kF?Ct镗?6?|~???}??n??^????????????~~??????z??ݿ?T??|?????~????????????w??_??????~?????????????s???????m????߷??????_z??????{??????m?ܣ+??????????????j????????u?????ۧ??????~???????????ݿ޽u?ݿ????o????ǿ??~?????????O???K???~???{?G????^???????ޯ?߾????????K??????????????????_g?????????????????~??????_o?߯o?????????????'=????????o?I??y??u????VY|???;??~?????߿????????k???o?{???????}??w?????/v??????Z?????߯{??]???????޳??Z??W?_?????<??????????N??????o????5????ۯ?????????????????????q{޿?{n???=O????=?????˷????????m???gw?????????{??o???????볯W???߽??????ǝ??ߟ_??????e??]?O|???????????k????ߺ???????????Z1l???De??/hO??{d?O`
  380  ???Ht?A??ڿ???v???????E?׿jϪ?ξk??]??&?+??k??g?v??
  381  ?????k?????U????˕?5?U?
  382  ???s???>????OuE?W???}???????????????m?????????????X?????X???{????????Ō????WL???)??%???????;????????ݙņ??
  383  ?U?>?k????Q{M???7?n??g?ͱ??7?????]??3??]?홼kϟ??U??׻?VØņ??3???ϝ\??j??ԯ͙???n??????U?
  384  ???]잔??u??ַ??7???'?p8'?/?㥫?o??????????}????????[Z????y????????????????_???;??????o]???D????h/?
  385  ?K??b?
  386  x????"??g?????q?#????v?&???U?????Rk???m?????????z??2?t??҆?kw??W??????????????W???????????>?????#??????t?b??E~?????6??????????-?#O&????????1?:?~q?????r????/??,?????V?_??>???Ƽ!
  387  ll
  388  journalctl -k -b -1
  389  history

Next thing will be a rework of all our applications, in order to run each of them from a separate folder, as you suggested.

I am suspecting that the USB modem is doing all the problems, but why is it doing so only on Jetson TX2 (on the raspberry we have deployed - totally identical setup and hw, apart from the board itself - nothing ever happened)??

linuxdev · August 6, 2018, 5:09pm

So now we know there isn’t an EMI issue caused by camera servo over power. Regardless of whether there is EMI or an unescaped character in some critical I/O it still seems some sort of data corruption is occurring such that data is being redirected to a set of seemingly random files. EMI could still be the issue (just not via camera), or something in your I/O is running into a character it doesn’t like or which does something not expected (e.g., a “>” being taken as a literal command line argument instead of data via quoting).

While you are working on running in separate folders watch carefully for any pipe of data since a pipe could be interpreted as stdin/stdout and a misplaced “>” character (if preceded by a white space character) could convert from pipe to redirect. FYI, this is why many communications protocols do something like URL encoding for http…if you have a case where things mostly work, and then once in a while some stray (and rarely encountered) set of bytes breaks the stream in the wrong way (as a delimiter instead of data), then an otherwise perfect protocol fails. Be absolutely certain binary streams are always treated as binary without the possibility of text-based parsing. Always escape white space and “>” if you get a chance.

I couldn’t tell you why the RPi differs. There are so many possibilities where all of them are just slightly different. Even a different version of bash could cause a difference in behavior…you have a “corner case”. I suspect when you actually find out what it is that the problem will be something exceedingly simple (an old colleague gave a technical term to cases like this: “well, duh”). These are the most interesting cases a few years after being solved.

cenit · August 8, 2018, 12:41pm

So, I still have not finished the modifications to run every script and app from a different folder. In the meantime I think I could have found some hints of the culprit.
This is what I found in a log today from one of our Jetsons (I put … where I removed some words, which were totally unuseful - just an extremely long sequence of #-triplet-of-digits. Timeouts can be expected and the system should be able to deal with them, retrying the transmission later.

ssh: connect to host *redacted* port *redacted*: Connection timed out
rsync: connection unexpectedly closed (0 bytes received so far) [sender]\#012\#000\#000\#000\#000\#000\#000\#000\#000\#000.........\#020>\#231......\#000<0\#211......\#022:<\#355\#336\#251Fq#\#201\#277\#020\#023\#260\#374
/path/to/script: line 92: 20447 Segmentation fault      (core dumped) $RSYNC ${RSYNC_TIMEOUT} ${RSYNC_FLAGS} -e ${CUSTOM_SSH} $files $user@$host:$targetfolder

As you can see, inside the “stream” outputted by rsync (never seen doing this!? Is it the core dump written on the stdout??) and luckily catch by the log, there are many redirection operators (>, <). I didn’t change anything for now in the log system, I cannot even understand how this went into my log and not into some strange file in the home because of the redirections…

linuxdev · August 8, 2018, 8:28pm

I can’t say from the above, but it does support the idea that a pipe was broken due to some bug (perhaps the bug is an unescaped value in some recipient program and not necessarily rsync itself…something crashed, but it isn’t the same as a full stack frame, you don’t know what led up to the invalid content). One example might be a file name which has odd characters in it needing escape, or running out of disk space (or not enough RAM).

I would suggest finding out what the environment variables were in the rsync line, i.e., the expansion of:

$RSYNC ${RSYNC_TIMEOUT} ${RSYNC_FLAGS} -e ${CUSTOM_SSH} $files $user@$host:$targetfolder

One thing to note about the above rsync line where I’m suggesting to find out the expansion…none of those parameters are quoted (at least not according to the debug output). If there is unexpected white space or special characters, then these are quite different:

${RSYNC_FLAGS}
"${RSYNC_FLAGS}"

…also…

$files
"$files"

Now imagine that within “$files” it isn’t unheard of to need quotes to have the list of files be a single argument (and then the app parses files), and then to require yet more escape sequences or quoting of special characters within the file name…and the quotes within the quotes needing to be escaped (imagine if a quoted list of file names has to contain a quoted individual file name…and if that quoted file name within the quoted list of file names has its own quote as part of the name). I’ve worked on shell scripted tools to compare file system differences among entire root file systems, and believe me, all of the above are bound to happen and break any scripted tool (there is sometimes a good reason to use C/C++ instead of a loosely typed script language…those circumstances are easier to guard against).

It may be worth your time to use some scheme such as URL encoding or base64/base85 encoding within any pipe if the data might have content capable of breaking the pipe.

Topic		Replies	Views
Mouse and keyboard don't start on boot or stop after awhile Jetson Xavier NX usb	42	2347	October 18, 2021
Tegra Tx2 kernel crash Jetson TX2	33	2175	June 11, 2019
Enter OEM interface after reboot Jetson AGX Orin reboot	59	1862	December 28, 2022
Does the Jetson TX2 eMMC report correct life time estimation at runtime? Jetson TX2	10	1311	November 26, 2021
External boot drive cannot be updated Jetson TX2 boot	64	1043	August 9, 2023
Spidev shows up in devices, but no MOSI or CS activity when attempting to transmit Jetson TX2 spi , kernel , device-tree	21	1783	June 7, 2023
Possible UEFI memory leak and partition full Jetson AGX Xavier nvbugs , ota , uefi	45	663	January 16, 2025
Jetson TX-2 fail to flash; Ubuntu 16; Jetpack 3.1, 3.2 Jetson TX2	33	5996	October 18, 2021
Jetson Nano crashes after 3 to 10 days of operations Jetson Nano reboot	19	2145	October 29, 2022
After get ip address, ping gateway show: ping SendMsg: no buffer space available Jetson TX2	12	1537	October 18, 2021

Random files appear in the home folder (and bash history corrupted)

Related topics