Cannot get Apex to install on NVIDIA AMI 19.05 ( Ubuntu 18.04.3 LTS)

Hi,

Thanks for any help in advance.

I want to be able to install and use the Apex package (https://github.com/NVIDIA/apex) in order to use this (https://github.com/kaushaltrivedi/fast-bert) BERT implementation.

However, I’ve been running into many problems with this on various AMI’s and I’d really appreciate help with getting it installed.

My systems team have now set me up with an NVIDIA 19.05 AMI (running Ubuntu 18.04.3 LTS), but I have run into a problem where I don’t seem to be able to access cuda-10.1’s functionality.

Running nvidia-smi, I can see that the driver is there. When I check /usr/local, however, the cuda folder is not there.

I tried installing the cuda driver as per the official documentation. Several problems arose.

  • First, though the cuda folder appeared in /usr/local, I wasn't able to access it and nvcc --version gave me nothing (other than a message telling me that the cuda toolkit could be downloaded via sudo apt-get install).
  • Second, this totally messed up the NVIDIA driver and nvidia-smi no longer worked. I then had to completely relaunch the instance to return it to its original state.

I also seemed to run into trouble getting awscli installed so I can access data in an S3 bucket. Any ideas why that might be the case?

Again, thanks for any help in advance.

Darren

The 19.05 AMI:

https://docs.nvidia.com/ngc/ngc-ami-release-notes/index.html#volta-ami-19-05-0

Is intended to support NGC. i.e. the delivery of containerized workloads.

https://docs.nvidia.com/ngc/index.html

that means that the AMI itself has the GPU driver loaded, plus docker (and the nvidia docker runtime) but not a full CUDA toolkit install. That is why there is no cuda folder in /usr/local.

One option would be to get familiar with NGC and simply run the CUDA toolkit container:

https://ngc.nvidia.com/catalog/containers/nvidia:cuda

You can also get a well-curated pytorch container:

https://ngc.nvidia.com/catalog/containers/nvidia:pytorch

which might be a quicker starting point for the fast-bert you want to try out. Note that the pytorch container comes with AMP/Apex pre-installed:

https://github.com/NVIDIA/apex

However, it also should not be difficult to install the full CUDA toolkit, but you would want to not disrupt what is already in the AMI, in particular the GPU driver.

The driver that is in that AMI is 418.67 (refer to first link above) which supports up to CUDA 10.1

So the easiest approach would be to download the runfile installer for CUDA 10.1 from http://www.nvidia.com/getcuda and use that to install CUDA. Very important: select “no” or deselect the option to install the bundled driver. This will keep your existing driver intact, which is what you want.

At that point you should find a /usr/local/cuda, and then be sure to follow step 7 in the CUDA linux install guide:

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

so that when you type nvcc it finds the compiler. Thereafter you should be able to manually install pytorch, Apex, etc. and anything else you need.

Thanks very much for your help.

I’ve successfully installed the Cuda Toolkit (10.1) following your advice.

However, when following the post-installation steps I get the following error:

-bash: export: `:/usr/local/cuda-10.1/lib64': not a valid identifier

When I try to navigate to the cuda or the cuda-10.1 folder, located at /usr/local, I get the following:

-bash: cd: /usr/local/cuda-10.1: Permission denied

I have experienced this previously and could not work out how to uninstall the driver again without being able to access the folder itself. I amended the user profile to give it root priveledges, but this doesn’t seem to have worked either.

Any advice on how I could advance beyond this step?

Best,

Darren

Possibly the export command you placed in whatever file you modified was not quite right. Try doing:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.1/lib64

Can you navigate to the /usr/local folder?
From there, can you determine (e.g. using ls in that folder) that /usr/local/cuda and /usr/local/cuda-10.1 exist?

If so you have some sort of linux permissions problem. It’s not obvious to me exactly what it is. Here’s what permission output should look like on a bare machine install (not container):

$ ls -l /usr/local
total 16
...
lrwxrwxrwx.  1 root root   21 Apr 22 22:06 cuda -> /usr/local/cuda-10.1/
drwxr-xr-x. 18 root root 4096 Apr 22 22:07 cuda-10.1
...

It’s difficult to know exactly how this came about without understanding your entire install sequence, and whether you are operating in a container or on the base machine.

I performed the install on the base machine.

Using export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.1/lib64 doesn’t seem to make a difference.

I can navigate to the /usr/local folder, where I can see 'cuda-10.1’and ‘cuda’ as folders, where ‘cuda’ has a symbolic link to ‘cuda-10.1’ as per the install options for the toolkit.

Running

ls -l /usr/local

Gives the following:

(base) ubuntu@ip-10-110-6-240:/usr/local$ ls -l /usr/local
total 36
drwxr-xr-x  2 root root 4096 Apr 29 15:50 bin
lrwxrwxrwx  1 root root   21 Aug 23 15:22 cuda -> /usr/local/cuda-10.1/
drwx------ 18 root root 4096 Aug 23 15:23 cuda-10.1
drwxr-xr-x  2 root root 4096 Apr 29 15:50 etc
drwxr-xr-x  2 root root 4096 Apr 29 15:50 games
drwxr-xr-x  2 root root 4096 Apr 29 15:50 include
drwxr-xr-x  3 root root 4096 Apr 29 15:50 lib
lrwxrwxrwx  1 root root    9 Apr 29 15:50 man -> share/man
drwxr-xr-x  2 root root 4096 Apr 29 15:50 sbin
drwxr-xr-x  6 root root 4096 Aug 23 15:18 share
drwxr-xr-x  2 root root 4096 Apr 29 15:50 src

You have a permissions problem on the cuda-10.1 directory. I’m not sure how that came about.