Cannot boot host computer after using NVIDIA SDK Manager

Hi,

I’m a robotics engineer from a robotics OEM and the NVIDIA SDK manager just made it so my laptop doesn’t boot. I’d like to get support very quickly.

Details:
I’m trying to load Jetpack onto a TX2. To do this I downloaded the NVIDIA SDK Manager onto my development laptop which is a Dell XPS15 with Ubuntu 18.04 on it. I went through the steps of the SDK Manager. For step 1 I chose Jetson, Target Hardware: Tx2, Target Operating System: Jetpack 4.2. For step 2 the only thing I edited is where the files were to be install. I hooked a external hard drive up to my laptop, made two folders in it, and then specified those folders. Then I accepted the terms and conditions and hit next. After about 10 minutes I got a lot of installation errors. Only Cuda completed, the rest errored out. It then asked if I wanted to retry installing those failed components or stop trying. I hit stop trying which closed the SDK manager. I then tried to reopen the SDK manager but it wouldn’t open. I thought a reboot of my computer may help but now I cannot boot my laptop. I can boot into recovery mode and get to a root terminal but that’s it. From the root terminal when I do the ls command the only folder I see is /snap. How can I fix my laptop? Why did the SDK manager do this to my host PC? How do I load Jetpack on the TX2 without borking my laptop? BTW this also just happened to one of our other engineers who tried to do this just before me, but then went on vacation. We are very unhappy that this has happened to two of our development laptops and if we can’t get this resolved quickly we are going to recommend to our management to remove all products we sell that rely on the SDK manager.

Sorry for the trouble. I probably can’t help completely, but as a start, the most likely problem preventing boot to a GUI would be the graphics driver. Does the laptop have an NVIDA video card? If not, then the driver can’t install, and any attempt to switch to that driver may have left the GUI failing. Possibly things never got to that stage, but that’s where I’d start.

If you hold the shift key down while the GRUB bootloader runs, then I think it may offer you some recovery options. I cannot say for sure, but it may be there is a workable “repair” option there (someone else who has used this more might have some insight).

When you can get to a point where you can attempt this again, I would just install one component at a time. Flash runs entirely to the TX2, and then you can install components (such as CUDA) separately. Those components can be left out on the host, and if desired, added later to the host. There may be a lot which would be useful to debugging what happened with the SDKM, but until you can actually boot this of course does not much matter.

FYI, I tend to command line flash. This does not require the GUI and is not specific to Ubuntu. The down side is that it can’t install extra packages (like CUDA). However, if you have a reference system, and that system has CUDA and the other applications, then command line flash can use a clone of the reference system and very quickly and easily set things up with less difficulty. In the R32.3.1+, once you have an install, then OTA update has become possible (you just need the initial install via SDKM).

Thanks for the quick response. My laptop does have and NVIDIA graphics card in it but I use the nouveau driver because every time I have attempted to use the NVIDIA driver I get a boot loop.

Can you tell me why the NVIDIA SDK Manager would potentially mess up my host machines display driver? Its unclear to me why it would need to change any settings or files on my host PC other than downloading files from the internet. Especially if I set the download directory to a external SSD which is now unplugged.

I’m also skeptical that its a display driver issue just because I have had issues with that in the past and It has always resulted in a boot loop, but now I just get terminal text and it freezes at the end.

But more important than fixing my laptop right now is getting the TX2 flashes. If needed I will just reinstall ubuntu. When you said “I would just install one component at a time” were you referring to using the GUI to do that? Or the terminal? Or the terminal inside the GUI? Also does the TX2 need to be plugged into the host computer for Step 02? In my case it wasn’t, I was expecting to do that as part of step 03 but since step 02 uses the term install, that implies to me that that could be were I went wrong. Also what did you mean by “Flash runs entirely to the TX2”

Update:

I rebooted my laptop again to check to see what the last message it would freeze on was. The message is “Started Hold until boot process finishes up”. As I was googling this on my phone my laptop booted and gave me a login screen. This took about 5 minute, so turns out the NVIDIA SDK Manager did not make my laptop not be able to boot, it just increased the boot time to 5 minutes.

Anyways I’m now back to trying to flash this TX2. I’m tried the SDKM GUI once more, and I got an error about “no space left on device” so I cleared some space and hit try again, and now I’m getting these errors.

18:33:25 ERROR : OpenCV on Host : E
18:33:25 ERROR : OpenCV on Host : :
18:33:25 ERROR : OpenCV on Host : Sub-process /usr/bin/dpkg returned an error code (1)
18:33:25 ERROR : OpenCV on Host :
18:33:26 ERROR : OpenCV on Host : Host Deb package [{libopencv 3.3.1-2-gb3f86dcd5}] not installed
18:33:26 ERROR : OpenCV on Host : command terminated with error
18:33:26 ERROR : OpenCV on Host : install ‘OpenCV on Host’ failure, command < cd ’

I’d be open to trying the command line if you point me to a reference about how to do it.

NOTE: Extra time can be from incorrect drivers and falling back. This might apply regardless of slow boot or no boot to GUI. You might post the result of “cat /proc/cmdline” and “lsmod” to see if Nouveau is really gone.

Nouveau and NVIDIA drivers are mutually exclusive. The system will fail graphical mode if both are present, and it sounds like this might be what is happening (if anything interrupted SDKM in the middle of driver additions/changes, then this could account for the problem).

I happen to use Fedora, and put drivers in by hand rather than through packages. Along the line I’ve ended often blacklisting Nouveau through more “enthusiastic” methods which are a bit closer to immune to package issues. One of my discoveries (frustrations) was finding out Fedora had thrown Nouveau all the way into the initial ramdisk, and was present even when the root file system itself no longer had Nouveau. If you start Ubuntu and get the chance to drop into the GRUB command line, then you might be able to find out if it is just a GUI issue (this command line can override conflicting Nouveau even in the initrd).

To start with, you would not want to do this if you have only the Nouveau driver. Presumably, the NVIDIA install would have attempted to put in the NVIDIA driver, but quite possibly you still had something keeping Nouveau around (at least in the initrd even if not in what you expect to be boot from the hard drive). These steps will edit for only one boot, and so this won’t stick around regardless of whether it works or not, but would be a very good starting step for information. If it works, then we can put it into your GRUB as a default parameter.

As GRUB starts you might need to either hit the escape key a few times, or hold the shift key down, so on. We’re not interested in “recovery mode”, but we are interested in the GRUB command line. The goal is that when you find the command line with the kernel arguments we will append some text.

The part of a menu entry which is of interest will look “something like” this (but your “root=UUID=” will differ):

<b>linux</b> /vmlinuz-4.18.0-18-generic root=UUID=1234ebcd-babc-1234-abcd-12341234abcd ro

If you highlight your normal boot entry, I think it is the “e” key to “edit”. Once you get into the edit command line you can make one time edits which won’t save, but will work for a single boot (there will be some keyboard hints at the bottom of GRUB…it’d be hard for me to actually do and watch this while booting my computer and still use it to write in the forum thread, so I’m going by memory). The “linux” line will need you to append some text, which will look like this (the whole right side past “ro” needs to be appended):

linux /vmlinuz-4.18.0-18-generic root=UUID=9425ebcd-b620-4418-be94-7e8f83602f82 ro <u><i><b>rd.driver.blacklist=nouveau modprobe.blacklist=nouveau nvidia-drm.modeset=1</b></i></u>

For reference, the appended text is this, space delimited:

rd.driver.blacklist=nouveau modprobe.blacklist=nouveau nvidia-drm.modeset=1

This part at the end of the “linux…” line turns this into a very long line…be sure to not hit the enter key. Then control-x should execute that edited entry. If it boots to GUI, or if it gets further along, then you probably found the problem (lingering Nouveau which is incompatible with the NVIDIA drivers). The NVIDIA driver always has to go in prior to CUDA going in, and CUDA must go in prior to any other CUDA-based software on the host. Video drivers do not normally load again (and still run in RAM) until reboot, or at least until changing from graphical to text mode and back.

Passing this command to the kernel command line will “probably” override any other attempt from other configuration to load Nouveau.

Rescue mode will not attempt to boot “normally”, and so it is important to try this edit on whatever GRUB entry runs when you just let the system boot without intervening (which is why we want to find the GRUB entry for normal boot, and not for rescue shells).