Torch, Ubuntu 14.04.3, CUDA 7.5.18, NVIDIA 352.39, Linux 3.19.0-37 and greater kernel faults

Hi… I’ve been posting about this here: https://goo.gl/h8XGEi

But, I figured since I’ve finally narrowed it down, perhaps if I posted it here, maybe someone at NVIDIA could debug it further than I’ve been able to do.

So, this is my configuration: nvidia-smi -L lists

GPU 0: Quadro M6000 (UUID: GPU-09446504-6a9e-866a-a65d-0f1d55b7657b)
GPU 1: Tesla K40c (UUID: GPU-4d14695e-3e43-bf43-a3e3-91190f696d39)
GPU 2: Tesla K40c (UUID: GPU-e992022a-724f-8f47-e08f-a954053020e6)

I started using Ubuntu Server 14.04.3, my uname -a shows

Linux gpu 3.19.0-41-generic #46~14.04.2-Ubuntu SMP Tue Dec 8 17:46:10 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

and I installed CUDA with cuda_7.5.18_linux.run, NVIDIA driver version 352.39 (tho I’ve downloaded later versions and the error persists with later versions) and torch installed from github.

with all that, if I run this script:

#! /usr/local/src/torch/install/bin/th

require "cutorch"

it never returns. the kern.log tho shows a kernel fault

BUG: unable to handle kernel NULL pointer dereference at 0000000000000020

I’ve linked the full kernel dump here: http://cablemodem.hex21.com/~binesh/kern.log.

So, my first step was to rule out hardware issues, so I ripped out each card one by one, and the bug only happens if I have all three cards in. So, M6000 alone, or M6000 with either one of the Tesla’s doesn’t cause a kernel fault, Only all three will cause the kernel fault. So, then, I decided to reinstall Ubuntu Server 14.04.3, and noticed that then, torch runs that script (that isn’t doing anything real, it’s simply requiring “cutorch”) without any issue.

But, if I then run apt-get update; apt-get dist-upgrade, and bring the linux kernel back up to 3.19.0-41, then I get the kernel faults again.

So, that narrowed it down from 3.19.0-25 which is fine, to 3.19.0-41 which kernel faults.

A binary search between 25 and 41, showed finally, that linux 3.19.0-33 works, whereas the next available version: 3.19.0-37 fails again.

Unfortunately, I don’t know enough to be able to debug further, but I’m really hoping someone from NVIDIA would be able to verify my problem or dig into it more. I’d really like to be able to upgrade my ubuntu to the latest version. (Although, at least now that I have this, I can upgrade, and then downgrade only the linux kernel…)

So… That’s about it… Actually… Has anyone else seen this issue? Or is my configuration so unique that it’s a problem only for me? In any case. I’m posting it here so someone might be able to dig further. Thanks!

Here’s an experiment I would try.

  1. Start with a clean load of ubuntu
  2. before installing any nvidia software at all, update your kernel
  3. Install the nvidia 352.68 driver using the driver runfile method i.e. get the driver from here:

[url]http://www.nvidia.com/download/driverResults.aspx/96762/en-us[/url]

  1. Install the CUDA 7.5 toolkit using the runfile installer method, while selecting “no” during that method when prompted to install the bundled driver (352.39). In other respects, follow the installation guide linked below.
  2. Repeat your test.

If that method still produces the same observations, I would file a bug at developer.nvidia.com. Note that Step 3 above will require that the appropriate kernel headers are installed for the new kernel loaded in step 2.

There are several concerns here:

  1. When CUDA 7.5 on linux indicates that it is supported on Ubuntu 14.04,it means with the default kernel for that distro. In some cases, moving to a newer kernel may introduce an issue that was not caught during QA testing of the CUDA 7.5 package on linux. So you’re in uncharted territory and in the general case, arbitrary updates to a linux kernel are not a supported configuration for CUDA.

  2. It’s not clear from your description whether you are doing this or not, but it seems to be the case: If you install the NVIDIA driver, and then update to an arbitrary newer kernel, that can definitely break things. The driver can be installed in 2 different ways: the package method, which generally involves loading specially built pre-compiled driver interfaces to the kernel, and the runfile installer method, which re-creates (re-compiles) the needed kernel interfaces when you run the installer. If the underlying interface headers change from one linux kernel to the next, this can invalidate a previous driver install. By using the runfile installer method (after installation of the new kernel), we will re-create the needed interfaces for the new kernel.

So in the final analysis, a 3.19.0 kernel is not officially supported on ubuntu 14.04, and this is evident by looking at the CUDA 7.5 installation guide:

[url]Installation Guide Linux :: CUDA Toolkit Documentation

so your mileage may vary. But before we even consider that, it’s possible you’ve invalidated the driver install by updating the kernel after the driver was installed. The above method should at least fix that issue and remove it from consideration.

Finally, a 3.19.0 kernel is officially supported on Ubuntu 15.04, so you might try that. (Again, kernel updates may present an issue.)

Oh, I think I wasn’t clear when I said “I started using Ubuntu Server 14.04.3”.

Although initially I had a Ubuntu install that I had been using for several months, when this problem started happening, I made a backup of my system, and all of what I reported was on a complete brand new install of Ubuntu Server 14.04.3. So, my tests are all based on fresh loads of ubuntu, not ones that I’ve been using forever. At this stage, I’ve actually reinstalled ubuntu about a dozen times on this same box.

Hi,
thanks a lot for this post. We have exactly the same issue:

  • 3x GeForce GTX TITAN Black
  • Clean installation of Ubuntu 14.04.3 LTS.
  • With 3.19.0-42-generic kernel, torch aplications when using CUDA hangs at exit in kernel (zombie). Reboot command always hangs, hard-reset required. We tried different CUDA versions (6.5, 7.0, 7.5), different nvidia drivers (from Ubuntu repo, from nvidia website), all resulting the same problem.
  • With 3.19.0-33-generic, no problems at all, nvidia driver 352.68 from Ubuntu repos, both CUDA 7.0 and 7.5.

I guess it’s a nvidia kernel driver issue. I’m able to replicate the problem, if someone is interested in logs/debugs.

Filing a bug via the bug reporting form linked from the CUDA registered developer website seems like the best course of action. Multiple people filing bugs won’t hurt anything, it is entirely possible that what looks like the same issue initially may actually trace back to a different root cause.

I confirm the same issue with the K80. Only 3.19.0-33-generic works fine without kernel oops (for example by running the p2pBandwith test from the cuda samples).

@frosenberg

This issue has already been fixed in the newest driver versions from R352(or R361) driver family.
Could you please verify it again with the newer driver(says, v352.79)? Thanks.