CUDA installation on an AWS Unbuntu 14.04 hanging.

I have been working on installing CUDA 7-5 on an AWS EC2 instance. Things seem to have changed overnight, because the same installation script I ran yesterday is no longer working today. Instead, it is hanging after installing the drivers. When I look in /var/log/kern.log, I see “BUG: unable to handle kernel NULL pointer.”

I am installing on an ubuntu 14.04 base AMI (ami-abc620cb) in us-west-2 on a g2.8xlarge. I run

wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/cuda-repo-ubuntu1404_7.5-18_amd64.deb
dpkg -i cuda-repo-ubuntu1404_7.5-18_amd64.deb
                
apt-get update
apt-get upgrade -y
apt-get install linux-image-generic
apt-get install -y --no-install-recommends --force-yes cuda-nvrtc-7-5 cuda-cudart-7-5 cuda-drivers cuda-core-7-5 cuda-driver-dev-7-5
nvidia-smi

with the last command hanging. I have tried several variations on the ‘linux-image-generic’ line (which I understand is to get some kernel module that is not installed by default.) I have also tried replicating this for 6.5 and 7.0. In all cases, I get a similar repro.

The weird thing is that this same script used to work, and all of the commands here are variations on this installation script, which has been around for a while:

https://github.com/BVLC/caffe/wiki/Caffe-on-EC2-Ubuntu-14.04-Cuda-7

So, something changed, but it’s beyond me to determine why. Any help most appreciated.

–braxton

Here’s the full trace from /var/log/kern.log

Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.168420] BUG: unable to handle kernel NULL pointer dereference at           (null)
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389] IP: [<ffffffff8172b9cb>] __down_common+0x4c/0x144
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389] PGD 79122f067 PUD 78bdcd067 PMD 0 
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389] Oops: 0002 [#1] SMP 
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389] Modules linked in: nvidia(POX+) drm btrfs raid6_pq xor ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs xt_nat xt_tcpudp xt_addrtype xt_conntrack ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_filter ip_tables x_tables nf_nat nf_conntrack bridge stp llc dm_thin_pool dm_persistent_data dm_bufio dm_bio_prison libcrc32c dm_crypt syscopyarea sysfillrect crct10dif_pclmul crc32_pclmul serio_raw sysimgblt fb_sys_fops isofs aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd psmouse floppy
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389] CPU: 14 PID: 55365 Comm: nvidia-persiste Tainted: P           OX 3.13.0-77-generic #121-Ubuntu
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/07/2015
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389] task: ffff8800bb838000 ti: ffff880790646000 task.ti: ffff880790646000
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389] RIP: 0010:[<ffffffff8172b9cb>]  [<ffffffff8172b9cb>] __down_common+0x4c/0x144
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389] RSP: 0018:ffff880790647b68  EFLAGS: 00010092
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389] RAX: 0000000000000000 RBX: ffffffffa0b504c0 RCX: 0000000000000000
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389] RDX: ffffffffa0b504c8 RSI: ffff880790647b70 RDI: ffffffffa0b504c0
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389] RBP: ffff880790647bb8 R08: 0000000000000296 R09: ffffffffa087ecfb
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389] R10: 0000000000000008 R11: 00000000000000ff R12: 7fffffffffffffff
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389] R13: ffff8800bb838000 R14: 0000000000000002 R15: 0000000000000000
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389] FS:  00007f513a095740(0000) GS:ffff88079edc0000(0000) knlGS:0000000000000000
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389] CR2: 0000000000000000 CR3: 0000000790303000 CR4: 00000000000406e0
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389] Stack:
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389]  ffff880790647cd0 ffffffffa0b504c8 0000000000000000 ffff8807918eb600
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389]  0000000000000000 ffffffffa0b504c0 ffff8807905e8000 0000000000000003
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389]  ffff8807900d92b8 0000000000000002 ffff880790647bc8 ffffffff8172bae0
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389] Call Trace:
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389]  [<ffffffff8172bae0>] __down+0x1d/0x1f
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389]  [<ffffffff810b10c1>] down+0x41/0x50
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389]  [<ffffffffa087f037>] nvidia_open+0x3c7/0x9b0 [nvidia]
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389]  [<ffffffffa087ddd9>] nvidia_frontend_open+0x49/0xa0 [nvidia]
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389]  [<ffffffff811c2f3f>] chrdev_open+0x9f/0x1d0
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389]  [<ffffffff811bba73>] do_dentry_open+0x233/0x2e0
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389]  [<ffffffff811c2ea0>] ? cdev_put+0x30/0x30
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389]  [<ffffffff811bbda9>] vfs_open+0x49/0x50
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389]  [<ffffffff811cace1>] do_last+0x541/0x1200
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389]  [<ffffffff8131677b>] ? apparmor_file_alloc_security+0x5b/0x180
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389]  [<ffffffff811cdf1b>] path_openat+0xbb/0x640
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389]  [<ffffffff810c960d>] ? call_rcu_sched+0x1d/0x20
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389]  [<ffffffff811cf30a>] do_filp_open+0x3a/0x90
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389]  [<ffffffff811dc167>] ? __alloc_fd+0xa7/0x130
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389]  [<ffffffff811bd8c9>] do_sys_open+0x129/0x280
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389]  [<ffffffff811bda3e>] SyS_open+0x1e/0x20
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389]  [<ffffffff81735d1d>] system_call_fastpath+0x1a/0x1f
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389] Code: 54 49 89 d4 48 8d 57 08 53 48 89 fb 48 83 e4 f0 48 83 ec 28 48 8b 47 10 48 8d 74 24 08 48 89 54 24 08 48 89 44 24 10 48 89 77 10 <48> 89 30 4c 89 f0 4c 89 6c 24 18 83 e0 01 c6 44 24 20 00 48 89 
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389] RIP  [<ffffffff8172b9cb>] __down_common+0x4c/0x144
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389]  RSP <ffff880790647b68>
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389] CR2: 0000000000000000
Feb 19 18:04:18 ip-10-214-11-24 kernel: [ 1010.172389] ---[ end trace 022a1ea98d066145 ]---

@braxton

This issue should be fixed in the newer driver versions from R352 driver family.
Could you please try it again with the the latest driver versions?(says, v352.79)?