Error installing nvidia drivers on x86_64 amazon ec2 gpu cluster (T20 GPU)

divick · June 17, 2013, 11:25am

Hi,

I am trying to install the nvidia drivers on the amzon gpu cluster but I get error when installing drivers. The nvidia-insaller.log is attached.

The kernel version is:

3.8.0-19-generic

lshw reports the following:

lshw -C display
WARNING: you should run this program as super-user.
*-display:0 UNCLAIMED
description: VGA compatible controller
product: GD 5446
vendor: Cirrus Logic
physical id: 2
bus info: pci@0000:00:02.0
version: 00
width: 32 bits
clock: 33MHz
capabilities: vga_controller bus_master
configuration: latency=0
resources: memory:d0000000-d1ffffff memory:d7100000-d7100fff
*-display:1 UNCLAIMED
description: 3D controller
product: GF100GL [Tesla T20 Processor]
vendor: NVIDIA Corporation
physical id: 3
bus info: pci@0000:00:03.0
version: a3
width: 64 bits
clock: 33MHz
capabilities: bus_master cap_list
configuration: latency=0
resources: memory:d2000000-d3ffffff memory:c0000000-c3ffffff memory:c4000000-c7ffffff ioport:c100(size=128) memory:d7000000-d707ffff
*-display:2 UNCLAIMED
description: 3D controller
product: GF100GL [Tesla T20 Processor]
vendor: NVIDIA Corporation
physical id: 4
bus info: pci@0000:00:04.0
version: a3
width: 64 bits
clock: 33MHz
capabilities: bus_master cap_list
configuration: latency=0
resources: memory:d4000000-d5ffffff memory:c8000000-cbffffff memory:cc000000-cfffffff ioport:c180(size=128) memory:d7080000-d70fffff

Also as per Amazon Ec2 docs, the gpu cluster cg1.4x is based on Tesla M 2050 based GPUS but what lspci / lshw seems to report that it is Tesla T20 GPU’s. From what I understand Tesla M class GPUs are based on T20 chip so hopefully I have selected the right drivers.

The version of drivers that I have tried are NVIDIA-Linux-x86_64-319.23.run and NVIDIA-Linux-x86_64-319.17.run and both of them seem to report the same problem which do support Tesla M class GPUs.

Thanks & Regards,
Divick

karolherbst · June 17, 2013, 11:45am

I think that you don’t have the drm kernel modules installed on that system. The last part of the log file indicates, that the nvidia module doesn’t find drm_* symbols. Maybe you have to install them first or load them into the runtime via modprobe.

karolherbst · June 17, 2013, 11:52am

In addition to that, the missing “drm_gem_prime_export” symbols seems to be within the 3.9 kernel only. I don’t know it this symbol is a hard requirement, but maybe you should try an older nvidia driver if the installation of the drm modules doesn’t work or a newer kernel.

divick · June 17, 2013, 2:45pm

Hi, thanks for the reply. Hmm I see … that means I would need to build and install the kernel modules for the kernel installed on the amazon ami isn’t it? I then tried with an older AMI (i.e. for ubuntu 12.04 instead of ubuntu 13.04) and the driver installed just fine. Nevertheless when I get hold again of ubuntu 13.04 AMI, I will try building and installing the kernel modules for drm.

karolherbst · June 17, 2013, 3:55pm

you could look into /lib/modules and search there. A file named “modules.symbols” should have all symbols exported by the modules listed. You could also try to modprobe the drm module(s). Or you could look into the kernel configuration in /proc/config(.gz) and see there if the kernel is configured with drm.

Also lsmod should be worth a look.

sandipt · June 18, 2013, 2:31pm

I am unable to download your Attachment nvidia-installer.log . What error you are seeing ?

divick · June 19, 2013, 4:11am

The errors are as logged below:

→ Unable to determine if Secure Boot is enabled: No such file or directory
ERROR: Unable to load the kernel module ‘nvidia.ko’. This happens most frequently when this kernel mod
ule was built against the wrong or improperly configured kernel sources, with a version of gcc that dif
fers from the one used to build the target kernel, or if a driver such as rivafb, nvidiafb, or nouveau
is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA graphics device
(s), or no NVIDIA GPU installed in this system is supported by this NVIDIA Linux graphics driver releas
e.

Please see the log entries ‘Kernel module load error’ and ‘Kernel messages’ at the end of the file ‘/va
r/log/nvidia-installer.log’ for more information.
→ Kernel module load error: No such file or directory
→ Kernel messages:
[ 1117.323913] nvidia: Unknown symbol drm_gem_mmap (err 0)
[ 1117.323918] nvidia: Unknown symbol drm_ioctl (err 0)
[ 1117.323928] nvidia: Unknown symbol drm_gem_object_free (err 0)
[ 1117.323942] nvidia: Unknown symbol drm_read (err 0)
[ 1117.323957] nvidia: Unknown symbol drm_gem_handle_create (err 0)
[ 1117.323962] nvidia: Unknown symbol drm_prime_pages_to_sg (err 0)
[ 1117.324002] nvidia: Unknown symbol drm_pci_exit (err 0)
[ 1117.324079] nvidia: Unknown symbol drm_release (err 0)
[ 1117.324084] nvidia: Unknown symbol drm_gem_prime_export (err 0)
[ 1863.597421] mtrr: no MTRR for d0000000,100000 found
[ 3341.270419] nvidia: Unknown symbol drm_open (err 0)
[ 3341.270426] nvidia: Unknown symbol drm_fasync (err 0)
[ 3341.270436] nvidia: Unknown symbol drm_poll (err 0)
[ 3341.270449] nvidia: Unknown symbol drm_pci_init (err 0)
[ 3341.270499] nvidia: Unknown symbol drm_gem_prime_handle_to_fd (err 0)
[ 3341.270517] nvidia: Unknown symbol drm_gem_private_object_init (err 0)
[ 3341.270532] nvidia: Unknown symbol drm_gem_mmap (err 0)
[ 3341.270537] nvidia: Unknown symbol drm_ioctl (err 0)
[ 3341.270546] nvidia: Unknown symbol drm_gem_object_free (err 0)
[ 3341.270559] nvidia: Unknown symbol drm_read (err 0)
[ 3341.270575] nvidia: Unknown symbol drm_gem_handle_create (err 0)
[ 3341.270580] nvidia: Unknown symbol drm_prime_pages_to_sg (err 0)
[ 3341.270619] nvidia: Unknown symbol drm_pci_exit (err 0)
[ 3341.270636] nvidia: Unknown symbol drm_release (err 0)
[ 3341.270639] nvidia: Unknown symbol drm_gem_prime_export (err 0)
ERROR: Installation has failed. Please see the file ‘/var/log/nvidia-installer.log’ for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

divick · July 5, 2013, 8:05am

This issue is resolved now. In case it is helpful to someone else putting my resolution here.

The issue is seen with Ubuntu 13.04 with kernel 3.8.0-19-generic. The issue as that it was unable to find and load the drm.ko module. Somehow I did not even find it installed in /lib/modules/3.8.0-19-generic. So I installed the kernel sources from ubuntu repository and then built the kernel and modules. And then I inserted the drm.ko and tried to build the nvidia drivers and it succeeded.

sudo apt-get source linux-image-3.8.0-19-generic
cd linux-3.8.0
sudo cp /boot/config-3.8.0-19-generic .config
sudo make menuconfig

Select

Device drivers —>
Graphics support —>
Direct Rendering Manager (XFree86 4.1.0 and higher DRI support) —>

make j16
sudo insmod ./drivers/gpu/drm/drm.ko
sudo NVIDIA-Linux-x86_64-319.23.run --opengl-headers

That’s all. BTW I trid installing the modules but somehow I still don’t see it in /lib/modules/3.8.0-19-generic/, so not sure if on reboot the nvidia kernel drivers will load or not.

divick · July 5, 2013, 4:26pm

I have found the issue with building of modules but not getting loaded on reboot. Apparently the kernel version that gets built show 3.8.13 instead of 3.8.0-19, so the modules get placed in /lib/modules/3.8.13.2/. So you need to change the kernel version at the top in the Makefile or by some other mechanism. I don’t know of a way to do so apart from this.

cancan101 · July 23, 2014, 2:59am

I am getting this error after trying modprobe nvidia after building the 331 drivers on an AWS GPU cluster machine with 14.04:

modprobe: ERROR: could not insert 'nvidia': Unknown symbol in module, or unknown parameter (see dmesg)

[ 9552.922683] nvidia: Unknown symbol drm_open (err 0)
[ 9552.922696] nvidia: Unknown symbol drm_poll (err 0)
[ 9552.922707] nvidia: Unknown symbol drm_pci_init (err 0)
[ 9552.922750] nvidia: Unknown symbol drm_gem_prime_handle_to_fd (err 0)
[ 9552.922763] nvidia: Unknown symbol drm_gem_private_object_init (err 0)
[ 9552.922775] nvidia: Unknown symbol drm_gem_mmap (err 0)
[ 9552.922779] nvidia: Unknown symbol drm_ioctl (err 0)
[ 9552.922787] nvidia: Unknown symbol drm_gem_object_free (err 0)
[ 9552.922798] nvidia: Unknown symbol drm_read (err 0)
[ 9552.922813] nvidia: Unknown symbol drm_gem_handle_create (err 0)
[ 9552.922819] nvidia: Unknown symbol drm_prime_pages_to_sg (err 0)
[ 9552.922857] nvidia: Unknown symbol drm_pci_exit (err 0)
[ 9552.922871] nvidia: Unknown symbol drm_release (err 0)
[ 9552.922874] nvidia: Unknown symbol drm_gem_prime_export (err 0)
[ 9836.615496] nvidia: Unknown symbol drm_open (err 0)
[ 9836.615509] nvidia: Unknown symbol drm_poll (err 0)
[ 9836.615520] nvidia: Unknown symbol drm_pci_init (err 0)
[ 9836.615564] nvidia: Unknown symbol drm_gem_prime_handle_to_fd (err 0)
[ 9836.615577] nvidia: Unknown symbol drm_gem_private_object_init (err 0)
[ 9836.615589] nvidia: Unknown symbol drm_gem_mmap (err 0)
[ 9836.615593] nvidia: Unknown symbol drm_ioctl (err 0)
[ 9836.615601] nvidia: Unknown symbol drm_gem_object_free (err 0)
[ 9836.615612] nvidia: Unknown symbol drm_read (err 0)
[ 9836.615626] nvidia: Unknown symbol drm_gem_handle_create (err 0)
[ 9836.615632] nvidia: Unknown symbol drm_prime_pages_to_sg (err 0)
[ 9836.615668] nvidia: Unknown symbol drm_pci_exit (err 0)
[ 9836.615682] nvidia: Unknown symbol drm_release (err 0)
[ 9836.615685] nvidia: Unknown symbol drm_gem_prime_export (err 0)

jorijnsmit · July 25, 2014, 8:47pm

Thanks for keeping us posted on your development.

I had the same problem today on Ubuntu Server 14.04. I tried your method of compiling the drm module and inserting it but to no avail. This was on kernel version 3.13 so it looks like the bug is still there.

Oh and instead of using make -j16 like you suggest in your post I used make drivers/gpu/drm/ as to not compile the full kernel but just the module. However a drm.ko file was never generated.

So I switched back to 12.04 and installing was not a problem. Works for now!

pco · September 27, 2014, 8:04pm

I had the same problem with Ubuntu 14.04.

What worked for me was a simple:

sudo apt-get install linux-image-extra-virtual

Then the NVIDIA driver installed without a hitch.

gun4w1 · November 10, 2015, 1:06am

thanks for the tutorial.

martyychang · January 18, 2016, 5:16am

I had a similar problem trying to install CUDA on an EC2 g2.2xlarge GPU instance with the Ubuntu Server 14.04 LTS (HVM), SSD Volume Type AMI (ami-d05e75b8).

Some characteristics below on the AMI, taken from the first login:

$ lsb_release -a

$ lspci

$ nvidia-smi

AWS Support gave me a quick answer on how to resolve the issue.

$ sudo apt-get update && sudo apt-get -y upgrade \
    # install the package maintainer's version (of /boot/grub/menu.lst)
$ sudo apt-get install -y linux-image-extra-`uname -r`
$ sudo apt-get update
$ wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda-repo-ubuntu1404-7-5-local_7.5-18_amd64.deb
$ sudo dpkg -i cuda-repo-ubuntu1404-7-5-local_7.5-18_amd64.deb
$ sudo apt-get update
$ sudo apt-get install -y cuda

To validate the installation using the CUDA Toolkit’s deviceQuery utility:

$ export PATH=/usr/local/cuda-7.5/bin:$PATH
$ export LD_LIBRARY_PATH=/usr/local/cuda-7.5/lib64:$LD_LIBRARY_PATH
$ cuda-install-samples-7.5.sh ~
$ cd ~/NVIDIA_CUDA-7.5_Samples/1_Utilities/deviceQuery/
$ make
$ ~/NVIDIA_CUDA-7.5_Samples/bin/x86_64/linux/release/deviceQuery

/home/ubuntu/NVIDIA_CUDA-7.5_Samples/bin/x86_64/linux/release/deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “GRID K520”
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 4096 MBytes (4294770688 bytes)
( 8) Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores
GPU Max Clock rate: 797 MHz (0.80 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 3
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GRID K520
Result = PASS

fjm282 · June 7, 2022, 9:07pm

Hello, I am having the same problem on my EC2 instance on aws . Any recommendations?
nvidia-installer.log (12.0 KB)