This week I had a Nano delivered and which I set it up using SD and then shifted over to 2TB NVMe drive using the jetsonhacks scripts. There’s a problem I’m running into where when I ssh into the box, after about 30s the session hangs, irrespective of whatever I’m doing, or which SSH client I use. It’s consistent, and there’s no issues coming up looking at journalctl -u ssh, or in syslog. I can log in with multiple SSH sessions, but they hang around the 30s mark. Interestingly, any commands that were running seem to run to completion, just the terminal session becomes unresponsive.
If I’m logged in via the GUI and have a terminal running there doesn’t seem to be a problem.
If I log in to ssh from the same Nano from a terminal in the GUI, it hangs after the password it entered.
Update this only seems to be an issue if I’m using wired ethernet. When using wifi I don’t seem to have the problem.
Has anyone seen this behavior before, or know how to resolve it?
I got confused by some comments here.
I can log in with multiple SSH sessions, but they hang around the 30s mark. Interestingly, any commands that were running seem to run to completion, just the terminal session becomes unresponsive.
What does that mean here? The ssh is hang but you could still input to it?
If I log in to ssh from the same Nano from a terminal in the GUI, it hangs after the password it entered.
What does that mean here? ssh to the device itself?
When I say into the box over (I’ve narrowed it down to just via Ethernet, WiFi doesn’t seem to have a problem) the terminal on the client stops getting any response. The cursor in my linux terminal or putty on windows just flashes as if the connection is still up but is completely unresponsive to any input and no further output is received.
I tried different variations of ssh’ing into the box. From remote machines, different OS, different clients, trying to rule out possible variables. When I said I ssh’d from the Nano, yes, one of the variations I tried was ssh’ing back into itself from a terminal, but when I did that I couldn’t get passed the point of entering the password. After logging into the password I would get the same none responsive terminal. This only happens when ssh’ing in via eth and not in WiFi.
Have you ever tried another Ubuntu host to ssh to your Jetson?
I have several Jetsons in remote lab but didn’t see any of such issue ever reported.
Yes, my linux machines are all running Ubuntu. When the Nano is plugged into ethernet I get the problem regardless of the OS or ssh client.
I mean could you try to flash your board with sdkmanager first?
I’m trying to find how to do that from the command line, as all my linux machines are headless. I’ve got the sdkmanager installed, and I’ve downloaded the components. Can’t seem to find in the docs whether it’s going to flash the nvme drive if I’ve got that hooked up via a usb adapter, or if the nano needs to be running, etc.
Hi,
The command is in this document:
And just to clarify, this is command for jetpack6.2. If you are using jp5, then it will be different document to refer to.
Arghhh… I picked up a new NVMe drive to rule that out of the equation, went through the steps for flashing the NVMe drive via the Linux host, was able to log in, everything looking good, logged into the newly flashed OS and then ssh’d in via the ethernet interface and snap… same problem! Has no problem working on wifi, but SSH sessions hang after about 30 seconds of being logged in if they’re via the ethernet interface. I’ve even tried plugging it into a different switch, no joy, same behavior. Really frustrating!
Has anyone else experience this, or do I have a flaky Nano?
Could you check the status of the board by using serial console ?
Below link:
Alrighty, I’ve got serial debug working, and I’m able to attach. Is there any specific log file that would be good to start with? I’m not seeing anything obvious, but I’m not 100% sure what I’m looking for either.
Attaching via serial or wifi doesn’t seem to have any issues, but connecting to ethernet still lasts about 30 seconds and before the session hangs. Nothing seems to show up out of the ordinary in /var/log/syslog, or /var/log/kern.log at the time of when the ethernet sessions hang. I see sshd cleanup after a timeout that corresponds to the 30s or so duration.
Jun 16 22:17:04 nano sshd[3724]: Accepted password for *REDACTED* from 10.110.0.132 port 62230 ssh2
Jun 16 22:17:04 nano sshd[3724]: debug1: monitor_child_preauth: user *REDACTED* authenticated by privileged process
Jun 16 22:17:04 nano sshd[3724]: debug1: monitor_read_log: child log fd closed
Jun 16 22:17:04 nano sshd[3724]: debug1: PAM: establishing credentials
Jun 16 22:17:04 nano sshd[3724]: pam_unix(sshd:session): session opened for user *REDACTED*(uid=1000) by (uid=0)
Jun 16 22:17:04 nano systemd-logind[526]: New session 10 of user *REDACTED*.
Jun 16 22:17:04 nano sshd[3724]: User child is on pid 3778
Jun 16 22:17:04 nano sshd[3778]: debug1: SELinux support disabled
Jun 16 22:17:04 nano sshd[3778]: debug1: PAM: establishing credentials
Jun 16 22:17:04 nano sshd[3778]: debug1: permanently_set_uid: 1000/1000
Jun 16 22:17:04 nano sshd[3778]: debug1: rekey in after 134217728 blocks
Jun 16 22:17:04 nano sshd[3778]: debug1: rekey out after 134217728 blocks
Jun 16 22:17:04 nano sshd[3778]: debug1: ssh_packet_set_postauth: called
Jun 16 22:17:04 nano sshd[3778]: debug1: active: key options: agent-forwarding port-forwarding pty user-rc x11-forwarding
Jun 16 22:17:04 nano sshd[3778]: debug1: Entering interactive session for SSH2.
Jun 16 22:17:04 nano sshd[3778]: debug1: server_init_dispatch
Jun 16 22:17:04 nano sshd[3778]: debug1: server_input_channel_open: ctype session rchan 0 win 1048576 max 16384
Jun 16 22:17:04 nano sshd[3778]: debug1: input_session_request
Jun 16 22:17:04 nano sshd[3778]: debug1: channel 0: new [server-session]
Jun 16 22:17:04 nano sshd[3778]: debug1: session_new: session 0
Jun 16 22:17:04 nano sshd[3778]: debug1: session_open: channel 0
Jun 16 22:17:04 nano sshd[3778]: debug1: session_open: session 0: link with channel 0
Jun 16 22:17:04 nano sshd[3778]: debug1: server_input_channel_open: confirm session
Jun 16 22:17:04 nano sshd[3778]: debug1: server_input_global_request: rtype no-more-sessions@openssh.com want_reply 0
Jun 16 22:17:04 nano sshd[3778]: debug1: server_input_channel_req: channel 0 request pty-req reply 1
Jun 16 22:17:04 nano sshd[3778]: debug1: session_by_channel: session 0 channel 0
Jun 16 22:17:04 nano sshd[3778]: debug1: session_input_channel_req: session 0 req pty-req
Jun 16 22:17:04 nano sshd[3778]: debug1: Allocating pty.
Jun 16 22:17:04 nano sshd[3724]: debug1: session_new: session 0
Jun 16 22:17:04 nano sshd[3724]: debug1: SELinux support disabled
Jun 16 22:17:04 nano sshd[3778]: debug1: session_pty_req: session 0 alloc /dev/pts/6
Jun 16 22:17:04 nano sshd[3778]: debug1: server_input_channel_req: channel 0 request env reply 0
Jun 16 22:17:04 nano sshd[3778]: debug1: session_by_channel: session 0 channel 0
Jun 16 22:17:04 nano sshd[3778]: debug1: session_input_channel_req: session 0 req env
Jun 16 22:17:04 nano sshd[3778]: debug1: server_input_channel_req: channel 0 request shell reply 1
Jun 16 22:17:04 nano sshd[3778]: debug1: session_by_channel: session 0 channel 0
Jun 16 22:17:04 nano sshd[3778]: debug1: session_input_channel_req: session 0 req shell
Jun 16 22:17:04 nano sshd[3778]: Starting session: shell on pts/6 for *REDACTED* from 10.110.0.132 port 62230 id 0
Jun 16 22:17:04 nano sshd[3779]: debug1: Setting controlling tty using TIOCSCTTY.
Jun 16 22:18:40 nano sshd[2759]: Read error from remote host 10.110.0.132 port 62256: Connection timed out
Jun 16 22:18:40 nano sshd[2759]: debug1: do_cleanup
Jun 16 22:18:40 nano sshd[2759]: debug1: temporarily_use_uid: 1000/1000 (e=1000/1000)
Jun 16 22:18:40 nano sshd[2759]: debug1: restore_uid: (unprivileged)
Jun 16 22:18:40 nano sshd[2707]: debug1: do_cleanup
Jun 16 22:18:40 nano sshd[2707]: debug1: PAM: cleanup
Jun 16 22:18:40 nano sshd[2707]: debug1: PAM: closing session
Jun 16 22:18:40 nano sshd[2707]: pam_unix(sshd:session): session closed for user *REDACTED*
Jun 16 22:18:40 nano sshd[2707]: debug1: PAM: deleting credentials
Jun 16 22:18:40 nano sshd[2707]: debug1: temporarily_use_uid: 1000/1000 (e=0/0)
Jun 16 22:18:40 nano sshd[2707]: debug1: restore_uid: 0/0
Jun 16 22:18:40 nano sshd[2707]: debug1: session_pty_cleanup2: session 0 release /dev/pts/2
Jun 16 22:18:40 nano sshd[2707]: debug1: audit_event: unhandled event 12
Jun 16 22:18:40 nano systemd-logind[526]: Session 5 logged out. Waiting for processes to exit.
Jun 16 22:18:40 nano systemd-logind[526]: Removed session 5.
After a few more minutes, things seem to degrade. The serial connection stops being interactive (can receive but no longer send), regular commands start giving a Input/output errors, and debug from serial is spitting out issues with the nvme drive. I’ve got two different nvme drives that I’ve done two installs on and I get the same behavior. I’d be really surprised if two different nvme drives from different manufacturers where one came from a previously good working system would both be screwed.
[ 2463.991624] EXT4-fs warning: 8009 callbacks suppressed
[ 2463.991638] EXT4-fs warning (device nvme0n1p1): dx_probe:822: inode #17301512: lblock 0: comm nvmemwarning.sh: error -5 reading directory block
[ 2463.991656] EXT4-fs warning (device nvme0n1p1): dx_probe:822: inode #17301506: lblock 0: comm nvmemwarning.sh: error -5 reading directory block
[ 2463.991833] EXT4-fs warning (device nvme0n1p1): dx_probe:822: inode #17301512: lblock 0: comm nvmemwarning.sh: error -5 reading directory block
[ 2463.991854] EXT4-fs warning (device nvme0n1p1): dx_probe:822: inode #17301506: lblock 0: comm nvmemwarning.sh: error -5 reading directory block
[ 2463.994897] EXT4-fs warning (device nvme0n1p1): dx_probe:822: inode #17301512: lblock 0: comm nvmemwarning.sh: error -5 reading directory block
[ 2463.994915] EXT4-fs warning (device nvme0n1p1): dx_probe:822: inode #17301506: lblock 0: comm nvmemwarning.sh: error -5 reading directory block
[ 2463.995351] EXT4-fs warning (device nvme0n1p1): dx_probe:822: inode #17301512: lblock 0: comm nvmemwarning.sh: error -5 reading directory block
[ 2463.995382] EXT4-fs warning (device nvme0n1p1): dx_probe:822: inode #17301506: lblock 0: comm nvmemwarning.sh: error -5 reading directory block
[ 2463.998809] EXT4-fs warning (device nvme0n1p1): dx_probe:822: inode #17301512: lblock 0: comm nvmemwarning.sh: error -5 reading directory block
[ 2463.998828] EXT4-fs warning (device nvme0n1p1): dx_probe:822: inode #17301506: lblock 0: comm nvmemwarning.sh: error -5 reading directory block
Just to clarify. So is this an issue that somehow the filesystem on the NVMe just got crashed with no reason?
I mean we need to identify what error do you see there. “Orin hangs after 30s using ethernet” is a too simplified symptom. For example, if system is totally dead, then it is normal that ssh won’t work.
We need to know if this is a common error that even got triggered when no ethernet is involved.
I was providing information that deviated from the norm. The filesystem issue was almost certainly related to the nano having to be powered down hard when it locks up. After fsck’ing on another machine those errors go away. I wasn’t sure if they also might be useful in triaging the persistent problem that is connecting with SSH via ethernet hangs the session, every time, guaranteed, around 30s. This does not happen at all if I don’t have ethernet connected.
I think the problem maybe related to the dropped RX packets. I’ve tried different MTU values without any issue.
ifconfig enP8p1s0
enP8p1s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1400
inet 10.0.0.191 netmask 255.255.0.0 broadcast 10.0.255.255
inet6 fe80::7b07:6581:475d:def5 prefixlen 64 scopeid 0x20<link>
inet6 fdd1:26b2:bc2b:d17a:c7ba:500b:11b1:f3e0 prefixlen 64 scopeid 0x0<global>
inet6 fdd1:26b2:bc2b:d17a:5fdc:ca3d:44e5:1702 prefixlen 64 scopeid 0x0<global>
ether 3c:6d:66:1e:b3:27 txqueuelen 1000 (Ethernet)
RX packets 87753 bytes 18051934 (18.0 MB)
RX errors 0 dropped 9459 overruns 0 frame 0
TX packets 6335 bytes 613759 (613.7 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 20 base 0xd000
The wifi interface also has dropped RX packets but nowhere near the same amount, especially given the volume of traffic going over it
wlP1p1s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.110.0.32 netmask 255.255.255.0 broadcast 10.110.0.255
inet6 fe80::c311:3fa6:d23f:8fd1 prefixlen 64 scopeid 0x20<link>
ether a8:e2:91:d9:bf:a9 txqueuelen 1000 (Ethernet)
RX packets 52724 bytes 11655799 (11.6 MB)
RX errors 0 dropped 1871 overruns 0 frame 0
TX packets 9295 bytes 2510775 (2.5 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
I turned sshd logging to DEBUG3 and see this, as I also see the dropped packets on the ethernet interface increase. I’ve connected the nano to different switches, used different ethernet cables, tried to ssh from different machines into the nano, but always get the same behavior.
Jun 17 12:37:27 nano sshd[19050]: Read error from remote host 10.110.0.135 port 49232: Connection timed out
Jun 17 12:37:27 nano sshd[19050]: debug1: do_cleanup
Jun 17 12:37:27 nano sshd[19050]: debug1: temporarily_use_uid: 1000/1000 (e=1000/1000)
Jun 17 12:37:27 nano sshd[19050]: debug1: restore_uid: (unprivileged)
Jun 17 12:37:27 nano sshd[19001]: debug1: do_cleanup
Jun 17 12:37:27 nano sshd[19001]: debug1: PAM: cleanup
Jun 17 12:37:27 nano sshd[19001]: debug1: PAM: closing session
Jun 17 12:37:27 nano sshd[19001]: pam_unix(sshd:session): session closed for user carterl
Jun 17 12:37:27 nano sshd[19001]: debug1: PAM: deleting credentials
Jun 17 12:37:27 nano sshd[19001]: debug1: temporarily_use_uid: 1000/1000 (e=0/0)
Jun 17 12:37:27 nano sshd[19001]: debug1: restore_uid: 0/0
Jun 17 12:37:27 nano sshd[19001]: debug1: session_pty_cleanup2: session 0 release /dev/pts/3
Jun 17 12:37:27 nano sshd[19001]: debug1: audit_event: unhandled event 12
Jun 17 12:37:27 nano systemd-logind[531]: Session 9 logged out. Waiting for processes to exit.
Jun 17 12:37:27 nano systemd-logind[531]: Removed session 9.
Ok. So we have confirm that this issue is still a pure ssh problem here, right?
Could you help me confirm
-
So when the ssh error happened, are you still able to operate your Jetson?
-
If Jetson is still able to get operated, could you share me full dmesg at that moment?
-
If other devices are not able ssh to Jetson, then are they able to ping Jetson at least?
-
How about Jetson ssh to other device? Is that possible?
Yes, still on the ssh issue.
1 - Yes. I can access it via serial, wifi, and ethernet although all ethernet connections are subject to the same behavior after about 30s.
2 - There’s nothing in the dmesg at that time other than the notification that I’ve plugged the cable in
[ 2071.780544] systemd[1]: systemd 249.11-0ubuntu3.16 running in system mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY -P11KIT -QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
[ 2071.781178] systemd[1]: Detected architecture arm64.
[ 2071.790052] systemd[1]: Using hardware watchdog 'NVIDIA Tegra186 WDT', version 0, device /dev/watchdog
[ 2071.790077] systemd[1]: Set hardware watchdog to 2min.
[ 2072.011286] systemd[1]: /lib/systemd/system/snapd.service:23: Unknown key name 'RestartMode' in section 'Service', ignoring.
[ 2072.017125] systemd[1]: /etc/systemd/system/nvs-service.service:41: Standard output type syslog is obsolete, automatically updating to journal. Please update your unit file, and consider removing the setting altogether.
[ 2072.351378] systemd-journald[246]: Received SIGTERM from PID 1 (systemd).
[ 2072.351555] systemd[1]: Stopping Journal Service...
[ 2072.353546] systemd[1]: systemd-journald.service: Deactivated successfully.
[ 2072.354295] systemd[1]: Stopped Journal Service.
[ 2072.355324] systemd[1]: resolvconf-pull-resolved.service: Deactivated successfully.
[ 2072.355995] systemd[1]: Finished resolvconf-pull-resolved.service.
[ 2072.359811] systemd[1]: Starting Journal Service...
[ 2072.384895] systemd[1]: Started Journal Service.
[35410.853899] r8168: enP8p1s0: link up
Nothing comes after it, and there’s no other events leading up to that look connected.
3 - All devices are able to ssh to the nano, just none of them can remain connected for longer than about 30s if connecting via ethernet.
4 - Yes, the nano can ssh to other devices without any issues and the connection remains stable.
Let me change my comment here.
3 - All devices are able to ssh to the nano, just none of them can remain connected for longer than about 30s if connecting via ethernet.
So Jetson is still alive and able to operate in UART this “after 30s” situation?
Just to be more clear. I don’t need to know the result of wifi. All the status check depends on the UART console itself.
What is you ssh from another device to Jetson in this hang situation? Will the Jetson accept the connection?
BTW, what is your network topology here? Is this going through a switch?