Kernel NULL pointer dereference (ubuntu 16.04.4, Tesla K80, and driver version 375.51)

Hi, I’m experimenting with CUDA on a VM machine from Microsoft Azure running a single K80 gpu and ubuntu 16.04 with kernel 4.4.0-75-generic. After server restart (or stop and start), dmesg outputs the following:

[   29.490275] nvidia: module license 'NVIDIA' taints kernel.
[   29.490278] Disabling lock debugging due to kernel taint
[   29.494453] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[   29.499355] nvidia 25bed32b:00:00.0: can't derive routing for PCI INT A
[   29.499357] nvidia 25bed32b:00:00.0: PCI INT A: no GSI
[   29.501669] nvidia-nvlink: Nvlink Core is being initialized, major device number 246
[   29.501679] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  375.51  Wed Mar 22 10:26:12 PDT 2017 (using threaded interrupts)
[   29.563154] EXT4-fs (sdc1): mounted filesystem with ordered data mode. Opts: (null)
[   29.633299] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  375.51  Wed Mar 22 09:00:58 PDT 2017
[   29.634479] [drm] [nvidia-drm] [GPU ID 0xd32b0000] Loading driver
[   29.751281] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 245
[   29.775289] BUG: unable to handle kernel NULL pointer dereference at 0000000000000340
[   29.798452] IP: [<ffffffffc075549f>] _nv011745rm+0x1f/0x170 [nvidia]
[   29.798452] PGD e9c20b067 PUD e9c20c067 PMD 0
[   29.798452] Oops: 0000 [#1] SMP
[   29.798452] Modules linked in: nvidia_uvm(POE) nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) drm_kms_helper drm fb_sys_fops syscopyarea sysfillrect sysimgblt pci_hyperv i2c_piix4 8250_fintek joydev input_leds mac_hid serio_raw hv_balloon ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp li
biscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic hv_netvsc hv_storvsc scsi_transport_fc hid_hyperv hid hv_utils hyperv_keyboard crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_
x86_64 lrw gf128mul glue_helper ablk_helper hyperv_fb cryptd psmouse pata_acpi hv_vmbus floppy fjes
[   29.798452] CPU: 5 PID: 931 Comm: nvidia-persiste Tainted: P           OE   4.4.0-75-generic #96-Ubuntu
[   29.798452] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006  01/06/2017
[   29.798452] task: ffff880e97dd8000 ti: ffff880e9cedc000 task.ti: ffff880e9cedc000
[   29.798452] RIP: 0010:[<ffffffffc075549f>]  [<ffffffffc075549f>] _nv011745rm+0x1f/0x170 [nvidia]
[   29.798452] RSP: 0018:ffff880e9cedf9d0  EFLAGS: 00010282
[   29.798452] RAX: 0000000000000000 RBX: ffff880e977dc008 RCX: ffff880e9d322f44
[   29.798452] RDX: 0000000000000008 RSI: 0000000000000000 RDI: ffff880e977dc008
[   29.798452] RBP: ffff880e9d322f40 R08: ffffffffc0e70cb0 R09: ffff880e9caa2008
[   29.798452] R10: ffff880e9caa2000 R11: ffffffffc0915c80 R12: 0000000000000000
[   29.798452] R13: ffff880e9216c008 R14: ffff880e9c5a2008 R15: ffff880e9b2cc008
[   29.798452] FS:  00007fc653648700(0000) GS:ffff880ea5740000(0000) knlGS:0000000000000000
[   29.798452] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   29.798452] CR2: 0000000000000340 CR3: 0000000e9ce0c000 CR4: 00000000001406e0
[   29.798452] Stack:
[   29.798452]  ffff880e977dc008 ffff880e977dc008 ffff880e9216c008 ffffffffc0748847
[   29.798452]  0000000000000000 ffffffffc07485c0 0000000000000000 ffff880e9216c008
[   29.798452]  ffff880e977dc008 0000000000000000 ffff880e9c5a2008 ffffffffc09802c7
[   29.798452] Call Trace:
[   29.798452]  [<ffffffffc0748847>] ? _nv011904rm+0x5f7/0x690 [nvidia]
[   29.798452]  [<ffffffffc07485c0>] ? _nv011904rm+0x370/0x690 [nvidia]
[   29.798452]  [<ffffffffc09802c7>] ? _nv012143rm+0x117/0x290 [nvidia]
[   32.200006]  [<ffffffffc0a24897>] ? _nv017616rm+0x327/0x480 [nvidia]
[   32.200006]  [<ffffffffc0a2606b>] ? _nv000800rm+0xeb/0x6e0 [nvidia]
[   32.200006]  [<ffffffffc0a1a098>] ? rm_init_adapter+0x128/0x130 [nvidia]
[   32.200006]  [<ffffffff810ac515>] ? wake_up_process+0x15/0x20
[   32.200006]  [<ffffffffc04684fd>] ? nv_open_device+0x12d/0x6d0 [nvidia]
[   32.200006]  [<ffffffffc0468d8d>] ? nvidia_open+0x14d/0x2f0 [nvidia]
[   32.200006]  [<ffffffffc0467328>] ? nvidia_frontend_open+0x58/0xa0 [nvidia]
[   32.200006]  [<ffffffff8121384f>] ? chrdev_open+0xbf/0x1b0
[   32.200006]  [<ffffffff8120c96f>] ? do_dentry_open+0x1ff/0x310
[   32.200006]  [<ffffffff81213790>] ? cdev_put+0x30/0x30
[   32.200006]  [<ffffffff8120db04>] ? vfs_open+0x54/0x80
[   32.200006]  [<ffffffff8121983b>] ? may_open+0x5b/0xf0
[   32.200006]  [<ffffffff8121d6b7>] ? path_openat+0x1b7/0x1330
[   32.200006]  [<ffffffff8121fa21>] ? do_filp_open+0x91/0x100
[   32.200006]  [<ffffffff8122d326>] ? __alloc_fd+0x46/0x190
[   32.200006]  [<ffffffff8120ded8>] ? do_sys_open+0x138/0x2a0
[   32.200006]  [<ffffffff8120e05e>] ? SyS_open+0x1e/0x20
[   32.200006]  [<ffffffff8183b972>] ? entry_SYSCALL_64_fastpath+0x16/0x71
[   32.200006] Code: 90 90 90 90 90 90 90 90 90 90 90 90 41 55 ba 08 00 00 00 41 54 53 48 83 ed 08 4c 8b a7 08 1e 00 00 48 89 fb 48 8d 4d 04 4c 89 e6 <41> ff 94 24 40 03 00 00 85 c0 be 00 00 56 00 75 30 0f b6 45 04
[   32.200006] RIP  [<ffffffffc075549f>] _nv011745rm+0x1f/0x170 [nvidia]
[   32.200006]  RSP <ffff880e9cedf9d0>
[   32.200006] CR2: 0000000000000340
[   32.201462] ---[ end trace 755709bc36b69e29 ]---

nvidia-bug-report.log (185 KB)

Hi df_670,
Please upload nvidia bug report somewhere else as soon as issue hit so I can download. Is the same issue repro on baremetal OS? What CUDA version and application you are running? Can you share reproduction steps for this issue? What instance of Azure VM you are using from https://azure.microsoft.com/en-in/blog/azure-n-series-preview-availability/ NC6 or NC12 or NC24?

Hey Sandip,

For the bug report, I tried running the script, but it hung. I tried uploading whatever was generated, but for some reason the file scanning software on this forum think it’s infected and keeps refusing to upload it. Is there any other way I get you the file?

I’m using nvidia-docker to run an image built on top of tensorflow-gpu on NC6 instance in East US region of Microsoft Azure.
The first time I ran the container, it worked fine. However, after shut down the vm (without explicitly stopping the container), it would always start in uninterruptible mode:

ps aux | grep nvidia

root        935  0.0  0.0  16488  4056 ?        D    May04   0:00 /usr/bin/nvidia-smi
nvidia-+    953  0.0  0.0  17100  1740 ?        Ds   May04   0:00 /usr/bin/nvidia-persistenced --user nvidia-persistenced
nvidia-+   1368  0.0  0.0 116364  9548 ?        Dsl  May04   0:00 /usr/bin/nvidia-docker-plugin -s /var/lib/nvidia-docker

I followed the following steps to set up the environment:

  1. Install Nvidia drivers:
DRIVER_VERSION="375.51"
wget http://us.download.nvidia.com/tesla/${DRIVER_VERSION}/nvidia-driver-local-repo-ubuntu1604_${DRIVER_VERSION}-1_amd64.deb

sudo dpkg -i nvidia-driver-local-repo-ubuntu1604_${DRIVER_VERSION}-1_amd64.deb
sudo apt-get update
sudo apt-get install -y cuda-drivers

rm nvidia-driver-local-repo-ubuntu1604_${DRIVER_VERSION}-1_amd64.deb

I didn’t reboot the machine before going to the next steps though (in case it matters)

  1. Install Docker
sudo apt-get install -y apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo apt-key fingerprint 0EBFCD88

sudo add-apt-repository \
   "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
   $(lsb_release -cs) \
   stable"

sudo apt-get update
sudo apt-get install -y docker-ce
  1. Install nvidia-docker
LATEST_VERSION="1.0.1"
wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v${LATEST_VERSION}/nvidia-docker_${LATEST_VERSION}-1_amd64.deb
sudo dpkg -i /tmp/nvidia-docker*.deb && rm /tmp/nvidia-docker*.deb
  1. Pull image and run it
nvidia-docker pull spatialcomputing/deep-learning-env-gpu
nvidia-docker run --rm -ti -v /a-mounted-drive-name:/drive-name spatialcomputing/deep-learning-env-gpu
  1. Run some Tesnforflow jobs (we use Keras on top of TF, if it matters).
  2. Shut down the machine without stopping the container
  3. Hopefully you’ll be able to reproduce the bug?

We only have access to a single NC6 instance, so unfortunately I can’t really spin up a new machine and test if this bug is reproducible