Unable to complete Post Installation of Jetpack 3.2 on TX1 - CUDA 9.0 installation fails

Hi

I am installing JetPack 3.2 L4T on NVIDIA TX1 and have failed a lot of times. Flashing completes successfully but during post installation, cuda-toolkit-9-0 fails to install, along with Visionworks.

I am using a router/switch and there are no problems in the connection, in my opinion.

After the post installation starts, host detects the correct IP of the system, and then …

I am stuck here :

Reading package lists… Done Reading package lists… Done Building dependency tree Reading state information… Done Connection to 172.20.0.241 closed.
dpkg-query: package .cuda-toolkit-9-0. is not installed and no information is available
dpkg-query: package .libfreeimage-dev. is not installed and no information is available
dpkg-query: package .libopenmpi-dev. is not installed and no information is available
dpkg-query: package .openmpi-bin. is not installed and no information is available
Use dpkg --info (= dpkg deb info) to examine archive files, and dpkg --contents (= dpkg-deb --contents) to list their contents. 1

Error: CUDA cannot be installed on device. This may be caused by other apt-get command running on device when installing CUDA. Please use apt-get command in at terminal to make sure following packages are installed correctly on device before continuing: cuda-toolkit-9-0 libgompl libfreeimage-dev libopenmpi-dev openmpi-bin after these packages are installed on device, press Enter key to continue

I checked that all these packages are already installed on the host. When I tried to continue with the installation, this happens at 3-4 times more with few more add-ons including Visionworks. After that, installation finishes successfully, but while checking TX1 for Cuda 9.0, it fails to detect it.

I checked with similar posts but couldn’t find a solution.

One possibility is that an automatic apt update was triggered and another process has dpkg locked. I’m not sure what the automatic update schedule is on a system just flashed (it would be annoying if every freshly flashed system tried to auto update immediately), but when the system is booted do you see anything from (other than a gvfs/fuse warning):

sudo lsof /var/lib/dpkg/lock
# OR:
ps aux | egrep -v '(dnsmasq|grep)' | egrep --color=never '(apt|dpkg)'

You can put this in another window to monitor once per second:

sudo watch -n 1 lsof /var/lib/dpkg/lock

If this is the case, then waiting till the update finishes would allow JetPack to do what it should, but there probably should not be an automatic update. If something has started to update, then a newly flashed system would take a very long time to do its first update.

This shouldn’t be necessary, but you can run this to be sure anything updating is no longer running and then start the package install again without rebooting:

sudo killall apt apt-get dpkg

Another possible cause is if you haven’t added the cuda repo pubkey (it should be automatic, but sometimes fails).
You may try to log into Jetson with ssh and try this .Then restart JetPack on host and select only post install steps (doesn’t need to flash, no recovery mode, no USB, just wired ethernet).

Hello,

When the system is booted, I don’t get anything unusual or a gvfs/fuse warning

First as suggested by linuxdev, I tried on my TX1 terminal:

sudo lsof /var/lib/dpkg/lock

Result : Asks for password and then displays this :

lsof: WARNING: can’t stat() fuse.gvfsd-fuse file system /run/user/1001/gvfs
Output information may be incomplete.

Then I tried this :

ps aux | egrep -v ‘(dnsmasq|grep)’ | egrep --color=never ‘(apt|dpkg)’

Result : Gives this list :
http://txt.do/dpuuq

There is of course nothing else running on the system, or the host, except, jetpack installation.
when I try to kill the apt-get process and go for :

sudo killall apt apt-get dpkg

Result :
apt: no process found
apt-get: no process found
dpkg: no process found

As suggested by Honey_Patouceul, I tried logging in via ssh to nvidia@tegra-ubuntu and manually add the pubkey, it says connection refused. From the link, do you mean using this command?


sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys F60F4B3D7FA2AF80


I have not spent a lot of time with TX1, but as far as I can understand, the Jetpack 3.2 flash and upgrade process is pretty straightforward. I have checked all my system dependencies and they seem to be alright, but still unsuccessful here.

Unfortunately the lsof doesn’t indicate anything holding a lock. The processes which did show up just happened to match a text pattern, but were unrelated.

The “killall” would be from the Jetson side. If there was nothing shown as killed then this too would imply nothing was holding a lock.

The ssh failure is the “smoking gun”. Had you been able to ssh in you could have run @Honey_Patouceul’s command…this would have made sure your repositories for packages had a key for ID.

Going to the ssh failure, JetPack cannot install packages if ssh is refused. So you have to fix ssh issues.

As background, every user which receives an ssh connection (the destination machine) has a “~/.ssh/” directory. If a connection has previously been used, and if that connection was accepted as valid, then the address will appear in “~/.ssh/known_hosts” along with a fingerprint. If something changed (which might be a “man in the middle” attack…or just a system which has had keys change due to flashing), then the connection would be refused. Sometimes the end of the ssh trying to connect to the other machine will also see a change in keys and refuse, but in this case it might be the fingerprint of the Jetson. Typically you can delete entries from “known_hosts” of the Jetson and things will work again (though you might need to do the same from the host PC for the user attempting to ssh to the Jetson).

One other cause for this, even if known_hosts does not see a key change and fail, is that the permissions of the “~/.ssh/” files must be exact…if those files are set with the wrong permissions, then ssh assumes malice and refuses ssh.

To start debugging, see what the exact permissions are for both host side and Jetson side:

cd ~/.ssh
ls -l

Not everyone will have an “authorized_keys” file, and you may have various key files. For these files though, the permissions must be exact. Compare your results to this (using “nvidia” account as an example):

<b>-rw-------</b>. 1 nvidia nvidia    0 Jul 21  2016 authorized_keys
<b>-rw-------</b>. 1 nvidia nvidia 1679 Feb  2  2016 id_rsa
-rw-r--r--. 1 nvidia nvidia  395 Feb  2  2016 id_rsa.pub
-rw-r--r--. 1 nvidia nvidia 4642 Jan 20 14:00 known_hosts

Note that if you have “authorized_keys”, then it must have rw permission for the user, and all other permissions denied. The same for “known_hosts”.

If you delete “known_hosts”, then a first connect will prompt you as to whether you want to use the unknown host and call it a known host. You can delete individual lines of this file, but if permissions are correct, then I’d start by just renaming this file for later examination, and then try command line ssh again (assuming you are on the Jetson using the nvidia account since this is what ssh will use):

cd ~/.ssh
mv known_hosts OLD_known_hosts
# From PC try ssh nvidia@<the_address_of_the_jetson>

Should ssh now function, then this implies you can try adding packages again.

Note that if you have “authorized_keys”, then it must have rw permission for the user, and all other permissions denied. The same for “known_hosts”.

Yes, I checked on the host that both “authorized_keys” and “known_hosts” have only rw permission and nothing else.

I was always able to succesfully ssh into TX1 from host.

I have a second TX1 and I tried ssh login with the new IP and it is able to login into TX1 as well using “ssh nvidia@IPaddress”. On first time logging in, I get this message below , which I guess is a legitimate way.

The authenticity of host ‘172.20.0.xxx (172.20.0.xxx)’ can’t be established.
ECDSA key fingerprint is SHA256:oTgKigtyuJgE0GInUM9EkZObS6bfjvfrxxxxxxxxxxx.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added ‘172.20.0.203’ (ECDSA) to the list of known hosts.

So ssh is not the issue, I assume.

However, I discovered that my authorised_keys name is ‘authorized_keys’ name is “au?thorized_keys” and “authorized_keys.pub” name is “au?thorized_keys.pub” which is weird and probably the fault with first time creating it.

I renamed it and checked ssh again and it works. Flashed it again and stuck at the same issue.

I have to wonder about the name change…quite possibly an alternate locale/language character set was involved. There are many places where characters can get translated while preparing on the host before transfer to the Jetson. Default is UTF-8 on en_US. On the you used for flasing, what do you get from the “locale” command? If it is something other than en_US.UTF-8 it might mean something was translated instead of copied verbatim (so far as the “?” character in those file names). In and of itself this is not particularly important, but it would imply there would be a large number of other files slightly “corrupted” (the software would think of it as corrupted, though in reality it would just be a different character set of the same thing).

I am wondering if downloads themselves are an issue, or if the tool’s selection of download is at issue. To check downloads, try this from both the host and the Jetson…see if these succeed in downloading “cuda-repo-l4t-9-0-local_9.0.252-1_arm64.deb”:

wget http://developer.download.nvidia.com/devzone/devcenter/mobile/jetpack_l4t/3.2GA/m892ki/JetPackL4T_32_b196/cuda-repo-l4t-9-0-local_9.0.252-1_arm64.deb

NOTE: This CUDA repo is for 64-bit L4T R28.2 (JetPack3.2).

If it does succeed, then on the Jetson try:

sudo dpkg -i /where/ever/it/is/cuda-repo-l4t-9-0-local_9.0.252-1_arm64.deb
# If errors exist:
sudo apt-get install -f

Finally, the Upgrade is successful. Thank you developers, your suggestions worked. Since this would be the closing thread, I will write detailed steps.

First, I realised that my host can ssh can has both known_hosts and authorized_keys with right permissions, but was not the case with Jetson.

On Jetson when I checked for the same, there was no known hosts file. So I ssh from Jetson into host to create that. It was successful. Then also had to used “chmod 600 authorized_keys” to change permissions for authorized_keys.

After that, I manually moved all the .deb packages from the host installation directory ‘Jetpack_downloads’ to Jetson (via USB). After that, started the post installation, it stopped at the same point(first post in this thread), and I installed the package manually using

sudo dpkg -i /where/ever/it/is/cuda-repo-l4t-9-0-local_9.0.252-1_arm64.deb
sudo apt-get install -f

and when it was done, continued with post installation on host, where it says “When these packages are installed on device, press Enter twice to continue”

and repeated this process (manual package installation) about 5-6 times during the post installation. Finally I checked and CUDA, Visionworks etc were successfully installed.

However. even after all ssh, authentication and permission problems were solved, I still don’t know what was the root of the problem? Why did I have to install the packages manually? Could something be done to improve the whole Jetpack Upgrade process?

I have only vague ideas on possibilities. One is that some cache you don’t know about somewhere was left in an invalid state. The other is similar, but either apt or dpkg may have had stopped in the middle of doing something which in turn left those in an invalid state which a manual operation cleared. An additional related problem which might contribute to the above is that if you don’t have a mechanism answering ssh when it asks for a password, then this can cause an ssh connection to stop after starting login and prevent the first ssh login (I usually tell people to install package “ssh-askpass” on the host, but it is also valuable on the Jetson…but it is the host which might use this during an initial login from host to Jetson).

Thank you so much @linuxdev for your inputs. I hope this thread helps someone stuck in a similar issue