nvidia jetson tk1 keeps stalling suddenly then crashing

Hello, I turned on my jetson today and it suddenly started freezing up. The mouse would run very slowly as I dragged it across the screen and then the screen would go completely black… I have no idea what could be causing this.

Everything from opening a web browser such as firefox or opening a folder is causing the whole system to freeze. Not even ctrl f2 is doing anything.

Ok now something even stranger happened, the borders of the windows on my desktop disappeared then everything went black, then the nvidia BETA DRIVERS splash window appeared, then everything went black again, now i’m back to the login screen…

Someone please help me…

One could use the serial port to debug this, it should keep going even as the system fails. Lots of things could be wrong.

What version of L4T is it running? They ship with R19.2, the latest is R21.4…several graphics display issues of the earlier R21.x series were fixed around R21.3.

Heat is a possibility as well, does the fan run?

If running on an unsecured network with no firewall or router to block incoming logins, it could even be something on the outside causing this (ubuntu login password is standardized, the entire world knows what it is unless changed).

FYI, the borders of the apps are part of the window manager, which makes sense if it is graphics chip overheating or a simple X11 crash.

What is really needed is to have that serial port console connected and see what it says. The DB9 connector uses 115200 8N1 settings and should show details even as the system fails (serial port does not require any kind of networking or graphics, its just too simple to fail except under extreme conditions). If your desktop host does not have a serial port, you can use something like a serial USB UART to do the job (in conjunction with something like minicom or gtkterm):

If you have a lot of time you could clone the Jetson and look through the log files, but the serial port is faster. Here’s info on cloning…the root file system can be loopback mounted and inspected:

Fan is running, but I can’t seem to find a way to check if it’s the temperature that’s the problem. Also, the router I’ve connected the jetson to has a firewall, however I have changed the password as you suggested.

I notice an error when I tried to install something "300.124073] Out of memory: Kill process 890 (Xorg) score 19 or sacrifice child

Killed process 890 (Xorg) total-vm:66300kB, anon-rss:29044kB, file-rss:10612kB

INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=21007 jiffies, g=5892, c=5891, q=9)
INFO: Stall ended before state dump start

Out of memory would definitely cause issues, even with hardware working correctly. Are you running anything that might consume memory? Imagine whatever you run could have a memory leak, and see if you can run without it. You could also set up swap on an SD card or SATA drive to give more time before failure and more chance to poke around.

If you can get your system running for a moment (especially with a remote login with a console that won’t go away in crash), you can monitor memory by something like:

while true ; do echo "" ; /usr/bin/head -n 2 /proc/meminfo ; sleep 1 ; done

FYI, a serial console is still the best way to watch this as it happens. Everything else is kind of a crutch for not having serial console.