Error installing nvidia drivers on x86_64 amazon ec2 gpu cluster (T20 GPU)

Hi,

I am trying to install the nvidia drivers on the amzon gpu cluster but I get error when installing drivers. The nvidia-insaller.log is attached.

The kernel version is:

3.8.0-19-generic

lshw reports the following:

lshw -C display
WARNING: you should run this program as super-user.
*-display:0 UNCLAIMED
description: VGA compatible controller
product: GD 5446
vendor: Cirrus Logic
physical id: 2
bus info: pci@0000:00:02.0
version: 00
width: 32 bits
clock: 33MHz
capabilities: vga_controller bus_master
configuration: latency=0
resources: memory:d0000000-d1ffffff memory:d7100000-d7100fff
*-display:1 UNCLAIMED
description: 3D controller
product: GF100GL [Tesla T20 Processor]
vendor: NVIDIA Corporation
physical id: 3
bus info: pci@0000:00:03.0
version: a3
width: 64 bits
clock: 33MHz
capabilities: bus_master cap_list
configuration: latency=0
resources: memory:d2000000-d3ffffff memory:c0000000-c3ffffff memory:c4000000-c7ffffff ioport:c100(size=128) memory:d7000000-d707ffff
*-display:2 UNCLAIMED
description: 3D controller
product: GF100GL [Tesla T20 Processor]
vendor: NVIDIA Corporation
physical id: 4
bus info: pci@0000:00:04.0
version: a3
width: 64 bits
clock: 33MHz
capabilities: bus_master cap_list
configuration: latency=0
resources: memory:d4000000-d5ffffff memory:c8000000-cbffffff memory:cc000000-cfffffff ioport:c180(size=128) memory:d7080000-d70fffff

Also as per Amazon Ec2 docs, the gpu cluster cg1.4x is based on Tesla M 2050 based GPUS but what lspci / lshw seems to report that it is Tesla T20 GPU’s. From what I understand Tesla M class GPUs are based on T20 chip so hopefully I have selected the right drivers.

The version of drivers that I have tried are NVIDIA-Linux-x86_64-319.23.run and NVIDIA-Linux-x86_64-319.17.run and both of them seem to report the same problem which do support Tesla M class GPUs.

Thanks & Regards,
Divick

I think that you don’t have the drm kernel modules installed on that system. The last part of the log file indicates, that the nvidia module doesn’t find drm_* symbols. Maybe you have to install them first or load them into the runtime via modprobe.

In addition to that, the missing “drm_gem_prime_export” symbols seems to be within the 3.9 kernel only. I don’t know it this symbol is a hard requirement, but maybe you should try an older nvidia driver if the installation of the drm modules doesn’t work or a newer kernel.

Hi, thanks for the reply. Hmm I see … that means I would need to build and install the kernel modules for the kernel installed on the amazon ami isn’t it? I then tried with an older AMI (i.e. for ubuntu 12.04 instead of ubuntu 13.04) and the driver installed just fine. Nevertheless when I get hold again of ubuntu 13.04 AMI, I will try building and installing the kernel modules for drm.

you could look into /lib/modules and search there. A file named “modules.symbols” should have all symbols exported by the modules listed. You could also try to modprobe the drm module(s). Or you could look into the kernel configuration in /proc/config(.gz) and see there if the kernel is configured with drm.

Also lsmod should be worth a look.

I am unable to download your Attachment nvidia-installer.log . What error you are seeing ?

The errors are as logged below:

-> Unable to determine if Secure Boot is enabled: No such file or directory
ERROR: Unable to load the kernel module ‘nvidia.ko’. This happens most frequently when this kernel mod
ule was built against the wrong or improperly configured kernel sources, with a version of gcc that dif
fers from the one used to build the target kernel, or if a driver such as rivafb, nvidiafb, or nouveau
is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA graphics device
(s), or no NVIDIA GPU installed in this system is supported by this NVIDIA Linux graphics driver releas
e.

Please see the log entries ‘Kernel module load error’ and ‘Kernel messages’ at the end of the file ‘/va
r/log/nvidia-installer.log’ for more information.
-> Kernel module load error: No such file or directory
-> Kernel messages:
[ 1117.323913] nvidia: Unknown symbol drm_gem_mmap (err 0)
[ 1117.323918] nvidia: Unknown symbol drm_ioctl (err 0)
[ 1117.323928] nvidia: Unknown symbol drm_gem_object_free (err 0)
[ 1117.323942] nvidia: Unknown symbol drm_read (err 0)
[ 1117.323957] nvidia: Unknown symbol drm_gem_handle_create (err 0)
[ 1117.323962] nvidia: Unknown symbol drm_prime_pages_to_sg (err 0)
[ 1117.324002] nvidia: Unknown symbol drm_pci_exit (err 0)
[ 1117.324079] nvidia: Unknown symbol drm_release (err 0)
[ 1117.324084] nvidia: Unknown symbol drm_gem_prime_export (err 0)
[ 1863.597421] mtrr: no MTRR for d0000000,100000 found
[ 3341.270419] nvidia: Unknown symbol drm_open (err 0)
[ 3341.270426] nvidia: Unknown symbol drm_fasync (err 0)
[ 3341.270436] nvidia: Unknown symbol drm_poll (err 0)
[ 3341.270449] nvidia: Unknown symbol drm_pci_init (err 0)
[ 3341.270499] nvidia: Unknown symbol drm_gem_prime_handle_to_fd (err 0)
[ 3341.270517] nvidia: Unknown symbol drm_gem_private_object_init (err 0)
[ 3341.270532] nvidia: Unknown symbol drm_gem_mmap (err 0)
[ 3341.270537] nvidia: Unknown symbol drm_ioctl (err 0)
[ 3341.270546] nvidia: Unknown symbol drm_gem_object_free (err 0)
[ 3341.270559] nvidia: Unknown symbol drm_read (err 0)
[ 3341.270575] nvidia: Unknown symbol drm_gem_handle_create (err 0)
[ 3341.270580] nvidia: Unknown symbol drm_prime_pages_to_sg (err 0)
[ 3341.270619] nvidia: Unknown symbol drm_pci_exit (err 0)
[ 3341.270636] nvidia: Unknown symbol drm_release (err 0)
[ 3341.270639] nvidia: Unknown symbol drm_gem_prime_export (err 0)
ERROR: Installation has failed. Please see the file ‘/var/log/nvidia-installer.log’ for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

This issue is resolved now. In case it is helpful to someone else putting my resolution here.

The issue is seen with Ubuntu 13.04 with kernel 3.8.0-19-generic. The issue as that it was unable to find and load the drm.ko module. Somehow I did not even find it installed in /lib/modules/3.8.0-19-generic. So I installed the kernel sources from ubuntu repository and then built the kernel and modules. And then I inserted the drm.ko and tried to build the nvidia drivers and it succeeded.

  1. sudo apt-get source linux-image-3.8.0-19-generic

  2. cd linux-3.8.0

  3. sudo cp /boot/config-3.8.0-19-generic .config

  4. sudo make menuconfig

    Select

Device drivers —>
Graphics support —>
Direct Rendering Manager (XFree86 4.1.0 and higher DRI support) —>

  1. make j16
  2. sudo insmod ./drivers/gpu/drm/drm.ko
  3. sudo NVIDIA-Linux-x86_64-319.23.run --opengl-headers

That’s all. BTW I trid installing the modules but somehow I still don’t see it in /lib/modules/3.8.0-19-generic/, so not sure if on reboot the nvidia kernel drivers will load or not.

I have found the issue with building of modules but not getting loaded on reboot. Apparently the kernel version that gets built show 3.8.13 instead of 3.8.0-19, so the modules get placed in /lib/modules/3.8.13.2/. So you need to change the kernel version at the top in the Makefile or by some other mechanism. I don’t know of a way to do so apart from this.

I am getting this error after trying modprobe nvidia after building the 331 drivers on an AWS GPU cluster machine with 14.04:

modprobe: ERROR: could not insert 'nvidia': Unknown symbol in module, or unknown parameter (see dmesg)
[ 9552.922683] nvidia: Unknown symbol drm_open (err 0)
[ 9552.922696] nvidia: Unknown symbol drm_poll (err 0)
[ 9552.922707] nvidia: Unknown symbol drm_pci_init (err 0)
[ 9552.922750] nvidia: Unknown symbol drm_gem_prime_handle_to_fd (err 0)
[ 9552.922763] nvidia: Unknown symbol drm_gem_private_object_init (err 0)
[ 9552.922775] nvidia: Unknown symbol drm_gem_mmap (err 0)
[ 9552.922779] nvidia: Unknown symbol drm_ioctl (err 0)
[ 9552.922787] nvidia: Unknown symbol drm_gem_object_free (err 0)
[ 9552.922798] nvidia: Unknown symbol drm_read (err 0)
[ 9552.922813] nvidia: Unknown symbol drm_gem_handle_create (err 0)
[ 9552.922819] nvidia: Unknown symbol drm_prime_pages_to_sg (err 0)
[ 9552.922857] nvidia: Unknown symbol drm_pci_exit (err 0)
[ 9552.922871] nvidia: Unknown symbol drm_release (err 0)
[ 9552.922874] nvidia: Unknown symbol drm_gem_prime_export (err 0)
[ 9836.615496] nvidia: Unknown symbol drm_open (err 0)
[ 9836.615509] nvidia: Unknown symbol drm_poll (err 0)
[ 9836.615520] nvidia: Unknown symbol drm_pci_init (err 0)
[ 9836.615564] nvidia: Unknown symbol drm_gem_prime_handle_to_fd (err 0)
[ 9836.615577] nvidia: Unknown symbol drm_gem_private_object_init (err 0)
[ 9836.615589] nvidia: Unknown symbol drm_gem_mmap (err 0)
[ 9836.615593] nvidia: Unknown symbol drm_ioctl (err 0)
[ 9836.615601] nvidia: Unknown symbol drm_gem_object_free (err 0)
[ 9836.615612] nvidia: Unknown symbol drm_read (err 0)
[ 9836.615626] nvidia: Unknown symbol drm_gem_handle_create (err 0)
[ 9836.615632] nvidia: Unknown symbol drm_prime_pages_to_sg (err 0)
[ 9836.615668] nvidia: Unknown symbol drm_pci_exit (err 0)
[ 9836.615682] nvidia: Unknown symbol drm_release (err 0)
[ 9836.615685] nvidia: Unknown symbol drm_gem_prime_export (err 0)

Thanks for keeping us posted on your development.

I had the same problem today on Ubuntu Server 14.04. I tried your method of compiling the drm module and inserting it but to no avail. This was on kernel version 3.13 so it looks like the bug is still there.

Oh and instead of using make -j16 like you suggest in your post I used make drivers/gpu/drm/ as to not compile the full kernel but just the module. However a drm.ko file was never generated.

So I switched back to 12.04 and installing was not a problem. Works for now!

I had the same problem with Ubuntu 14.04.

What worked for me was a simple:

sudo apt-get install linux-image-extra-virtual

Then the NVIDIA driver installed without a hitch.

thanks for the tutorial.

I had a similar problem trying to install CUDA on an EC2 g2.2xlarge GPU instance with the Ubuntu Server 14.04 LTS (HVM), SSD Volume Type AMI (ami-d05e75b8).

Some characteristics below on the AMI, taken from the first login:

$ lsb_release -a
$ lspci
$ nvidia-smi

AWS Support gave me a quick answer on how to resolve the issue.

$ sudo apt-get update && sudo apt-get -y upgrade \
    # install the package maintainer's version (of /boot/grub/menu.lst)
$ sudo apt-get install -y linux-image-extra-`uname -r`
$ sudo apt-get update
$ wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda-repo-ubuntu1404-7-5-local_7.5-18_amd64.deb
$ sudo dpkg -i cuda-repo-ubuntu1404-7-5-local_7.5-18_amd64.deb
$ sudo apt-get update
$ sudo apt-get install -y cuda

To validate the installation using the CUDA Toolkit’s deviceQuery utility:

$ export PATH=/usr/local/cuda-7.5/bin:$PATH
$ export LD_LIBRARY_PATH=/usr/local/cuda-7.5/lib64:$LD_LIBRARY_PATH
$ cuda-install-samples-7.5.sh ~
$ cd ~/NVIDIA_CUDA-7.5_Samples/1_Utilities/deviceQuery/
$ make
$ ~/NVIDIA_CUDA-7.5_Samples/bin/x86_64/linux/release/deviceQuery