Jetson Xavier NX doesn't boot after installing CUDA

Hi,

Today I was installing CUDA 11.6 in a Nvidia jetson nx. The idea was to use opencv with cuda to process CNN faster.
After rebooting the device I noticed the system doesn’t boot up again. I tried to reset the device, put it in recovery mode (pressing recovery button and hold+pressing reset button+release the recovery button), but it doesnt work. I can’t even see the screen. Only one led is on on the board.
I’m using a Aetina AN810 carrier board for the jetson module.
I also noticed I rebooted the device with 0 bytes of internal storage.
So, I would like to ask what can I do to recover the device.

Thanks

Did you install CUDA via JetPack/SDK Manager?

Note that recovery mode in itself does not alter a Jetson in any way. All it does is put it into a mode capable of flashing, but if the flash software does not run, then there is absolutely no change.

If you were to flash, then you’d need to use the board support package provided by the Aetnina carrier board manufacturer (unless it is an electrical exact match in layout to the dev kit carrier board).

Can you provide a serial console boot log? Quite often boot is successful, but lack of GUI makes it appear that the unit failed to boot. Also, if it previously had wired networking set up, and if it boots but fails to have a GUI, then ssh should probably still work (unfortunately the default Wi-Fi setup tends to run only upon GUI login).

I installed cuda through terminal. To make a clean installation I did the following procedure:

apt clean; apt update; apt purge cuda; apt purge nvidia-*; apt autoremove; apt install cuda

How can I get a serial console boot log?

I don’t know what would have been removed, but I suspect this was not truly a clean install. Most likely it removed more than CUDA. There are a lot of critical nvidia-* packages. This includes the GPU driver, at least on some releases.

Just to emphasize, a third party carrier board will require a third party board support package during flash (unless it is an electrical layout duplicate of the dev kit carrier board). So if the Aetina carrier board manufacturer provides flash software, then likely you must use that to flash, or it would fail to fully function.

For serial console, does this carrier board have a micro-OTG USB connector? It is a “micro” connector capable of using either a type-A micro-USB or type-B micro-USB connector. The dev kits come with a quality micro-B to full-sized type-A cable (it is always a type-A to type-B, but form factor differs). They look like a charger cable, and technically a charger cable would work, but about 2/3 of “charger cables” fail due to poor quality in data lines (they quite literally often have only two strands of copper in data to save cost).

Basically, if you were to go to the Linux host PC and monitor “dmesg --follow”, and then plug in the connector to the Jetson micro-USB and the host USB, I’d expect to see one or more serial devices listed as connected. Probably the first listed device is the serial console. You’d use any serial console program to connect. My favorite is gtkterm (“sudo apt-get install gtkterm”). Then you talk to that port with specification of 115200 8N1. For gtkterm, if your user is not in group dialout, then you’d use sudo, but here is the basic assuming the port is “/dev/ttyACM0”:
sudo gtkterm -b 8 -t 1 -s 115200 -p /dev/ttyACM0

Different serial console programs are picoterm, cutecom, minicom, PuTTY, and more. A serial console boot log always helps, but if you’ve removed some important NVIDIA package which removed a configuration (or had an unexpected side-effect), then you might end up having to reflash. Was there a reason for the clean install? This tends to be something more useful in the Windows world than it is in the Linux world.

So I eventually reflashed the board and I can access it now. However, the desktop looks a bit “weird”. For instance I only have 3 buttons on the left side menu. Plus, I also see a purple arrow on each desktop icon. I’m also seeing the following warning when booting up “Test key is used” ("Test Key is used" can be removed? - #3 by kayccc). Also I find a bit weird that when I flash the image (~1.5 GB) and I install CUDA (~1.5 GB), the internal disk becomes full. I can’t install anything else.
I checked Aetina website and they don’t provide any sdk (bsp).
I would like to know if this is normal to have ~16 GB of ubuntu+CUDA on the internal disk. If so, how can I put a sd card to work with the board? I noticed with the flashed image version the card is not detected. I have other nvidia where the sd card is recognized.

Do you have a screenshot of that? Or perhaps just a picture snapped from a smartphone?

Also, if you right click on the desktop, are you able to access a text console? If you can, does this command succeed (I’m not interested in the result other than whether or not it works without error):
sudo ls

FYI, the install software is compressed. It takes more disk space than the package size. Sort of like “Doctor Who”'s TARDIS: The inside is larger than the outside :P

Even long ago before things grew so much it wasn’t unusual for adding other content raised it to the point of filling a 14 GiB partition. Newer installations often take more. However, just to illustrate what the operating system itself uses, if you go to the host PC you use for flashing and go to this directory, followed by the command, it’ll tell you what the base operating system occupies at the moment of flash prior to adding any optional packages (e.g., CUDA):

cd ~/nvidia/nvidia_sdk/JetPack...version.../Linux_for_Tegra/rootfs/
sudo du -h -s .

The screenshot of the “Test key is used” can be found on the topic I mentioned above. It is the same message.
The sudo ls works ok.
I would like to know what can I do in order for the sd card to be recognized.

Thanks

I’m not sure if that test key message actually matters or not. However, can you still ping this Jetson over the wired network? Or was it set up to use only Wi-Fi upon login? If ssh works, or if serial console works, then you could in theory log in and remove some files to get space back. I don’t know if the package updates you performed matter or not, but if there is no space left on device, then this will be a problem no matter what the status of “test key” is. I don’t know if there is any way to work on an SD card until the operating system can be logged in to (I assume this is an eMMC model with an optional SD card, which is different than an SD card model dev kit without eMMC).

I have access to the desktop. My issue is that the behaviour of the board/Nvidia is not the same as before I installed CUDA.
For instance, when I boot, the fan starts to spin very fast (we can hear it). Before the CUDA installation and reflash, it didn’t happen. I also have other nvidia with aetina board (identical to the one I have problems) where the boot is fast, the fan doesnt spin, the sd card is recognized and I can access to it by teamviewer (no monitor attached). The Nvidia with problems can be accessed by teamviewer but the screen becomes black.
I contacted aetina and they told me the process to flash the card. It is by sdk manager. However they say I need to install a patch. The thing is their patch is for version 4.6.1, while the version that the sdk manager has is 5.01 and 5.02. I cant download the version 4.6.1 because the host has ubuntu 20.04. Can you give me advice on what to do regarding these problems? I mean is there any way of installing jetpack 4.6.1 from host with ubuntu 20.04?
What can I do to see the screen of nvidia desktop by teamviewer?

Thank you

Yes, there might be something shared between 4.x and 5.x, but it seems unlikely a 4.6.1 patch would work with 5.x unless lucky. Can you attach the patch? If it is something like a device tree fragment, then maybe it can be examined to see if that part of the tree is identical between 4.x and 5.x. Die Aetna know this is for 5.x and suggest the patch is useful there too?

So far as TeamViewer goes, this is fairly invasive software. There is a lot which can go wrong. Basically it is a new X server, but it has to use the NVIDIA driver. The driver in most releases will require building against a different ABI before it can properly “plug in” to the server. This is definitely a big issue going from 4.x to 5.x JetPack. The X server ABI differs by a lot, and the kernel itself differs. TeamViewer would have those same differences. I can’t answer what is required specifically in your case, but in the working and failing cases, you might want to post side-by-side for each the following information (if you don’t have gawk, then “sudo apt-get install gawk”):

lsmod
head -n 1 /etc/nv_tegra_release
gawk '/Module ABI versions/,/using VT/' /var/log/Xorg.*.log
dpkg -l | egrep '(nv-|-nv|nvidia|NVIDIA)'

Have you had TeamViewer work for any of the JetPack 5.x cases? I don’t know how to configure this, but this would definitely need a lot of changes to work on JetPack 5.x versus 4.x.

Good news! I was able to put the board/nvidia in the same state as before. In case someone faces the same problem, here is the explanation:
When flashing the module, it is necessary form the image of jetpack to be the same as the patch in carriers downloads page, i.e. if the version of the patch is 4.6.1 you need to flash the module with 4.6.1 image. However, sdk manager doesn’t have the 4.6.1 version if the host computer is 20.04. So the solution for me was to install in a VM, Ubuntu 18.04. After that I was able to download jetpack 4.6.1, followed the instructions from carrier pdf that they sent to me, applied the patch, flashed the board and now it works. Even the sd card is recognized. But I still have to test with teamviewer.
So, in conclusion, make sure the versions are coherent when flashing a nvidia jetson nx.
I would like to ask also, how can I install other packages on the sd card, as the the internal disk is full? Is there any way of installing a specific package on the sd card and it being used as if it was in the same disk as other packages? Right now I have to install a specific version of opencv, but I don’t have space. This opencv will work for a python program which also will use CUDA that is in usr/local of embedded disk. I just want to make sure everything works correctly.

Thank you very much for your support!

I don’t know of a simple way to tell most packages to install in an alternate location. I remember with Fedora there were many relocatable packages, but even then it was limited (and I’m not sure about Ubuntu). However, if you know where packages are actually installed, then you can mount an SD card partition on that location and everything going to that location ends up on the SD card. You would have to be quite careful though if you are interfering with system files.

  • FYI, if one properly sets up “/etc/fstab” to make mounting of a partition not error out if the device is missing, then the first part of setting up a removable device safely is met. Removing the device won’t cause boot failure.
  • If the content you are replacing with a removable device is not used in boot, then that is the next step of safely using removable media (or simply never removing the media). One example location, where CUDA lives, is “/usr/local”.
  • One can copy the content already at a location to a temporary mount point of the alternate media (making sure to preserve permissions, which in turn also means making sure you use an ext4 partition and not something like NTFS or VFAT).

What do you see currently from “df -H -T -t ext4”? What do you see from “sudo du -h -s /usr/local/”?