Install Cuda 10.0 on Ubuntu16.04 (for DGX-1)

Hi All,

I am trying to install CUDA-10.0 on Ubuntu 16.04 running on DGX-1 server.
I followed the instructions for “runfile installation” in https://docs.nvidia.com/cuda/archive/10.0/cuda-installation-guide-linux/index.html#runfile.

After step 4.2.6 (i.e. Reboot the system to reload the graphical interface.), I checked the CUDA version as follows:

nvcc --version

which returns:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

However, when I run:

nvidia-smi

it returns:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I went to step 4.4 (Device Node Verification.), and found that the device files /dev/nvidia* don’t exist.
I tried to create them manually, however, running:

sudo /sbin/modprobe nvidia

returns:

modprobe: ERROR: could not insert 'nvidia': Exec format error

Please help to solve the problem. Thanks!


Other details.

lspci | grep -i nvidia
06:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
07:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
0a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
0b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
85:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
86:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
89:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
8a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
uname -m && cat /etc/*release
x86_64
DGX_NAME="DGX Server"
DGX_PRETTY_NAME="NVIDIA DGX Server"
DGX_SWBUILD_DATE="2018-03-20"
DGX_SWBUILD_VERSION="3.1.6"
DGX_COMMIT_ID="1b0f58ecbf989820ce745a9e4836e1de5eea6cfd"
DGX_SERIAL_NUMBER=QTFCOU8280021
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.6 LTS"
NAME="Ubuntu"
VERSION="16.04.6 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.6 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
gcc --version
gcc (GCC) 5.4.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
uname -r
4.4.0-142-generic
cat /proc/version
Linux version 4.4.0-142-generic (buildd@lgw01-amd64-033) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) ) #168-Ubuntu SMP Wed Jan 16 21:00:45 UTC 2019
dpkg -l | grep nvidia
ii  dgx-peer-mem-loader                             1.1-10                                        amd64        Ensure nvidia is loaded before nv_peer_mem

DGX-1 software is mostly maintained and installed via package manager systems. You can use a runfile installer, but you’ll need to be aware of the conflicts that are inherent. These conflicts are documented in the CUDA linux install guide in the section “handle conflicting install methods”.

In short, CUDA 10 toolkit is installed, but your driver install is broken. You’ll need to clean up and remove all installation history, to rectify this.