Hi… I’ve been posting about this here: https://goo.gl/h8XGEi
But, I figured since I’ve finally narrowed it down, perhaps if I posted it here, maybe someone at NVIDIA could debug it further than I’ve been able to do.
So, this is my configuration: nvidia-smi -L lists
GPU 0: Quadro M6000 (UUID: GPU-09446504-6a9e-866a-a65d-0f1d55b7657b) GPU 1: Tesla K40c (UUID: GPU-4d14695e-3e43-bf43-a3e3-91190f696d39) GPU 2: Tesla K40c (UUID: GPU-e992022a-724f-8f47-e08f-a954053020e6)
I started using Ubuntu Server 14.04.3, my uname -a shows
Linux gpu 3.19.0-41-generic #46~14.04.2-Ubuntu SMP Tue Dec 8 17:46:10 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
and I installed CUDA with cuda_7.5.18_linux.run, NVIDIA driver version 352.39 (tho I’ve downloaded later versions and the error persists with later versions) and torch installed from github.
with all that, if I run this script:
#! /usr/local/src/torch/install/bin/th require "cutorch"
it never returns. the kern.log tho shows a kernel fault
BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
I’ve linked the full kernel dump here: http://cablemodem.hex21.com/~binesh/kern.log.
So, my first step was to rule out hardware issues, so I ripped out each card one by one, and the bug only happens if I have all three cards in. So, M6000 alone, or M6000 with either one of the Tesla’s doesn’t cause a kernel fault, Only all three will cause the kernel fault. So, then, I decided to reinstall Ubuntu Server 14.04.3, and noticed that then, torch runs that script (that isn’t doing anything real, it’s simply requiring “cutorch”) without any issue.
But, if I then run apt-get update; apt-get dist-upgrade, and bring the linux kernel back up to 3.19.0-41, then I get the kernel faults again.
So, that narrowed it down from 3.19.0-25 which is fine, to 3.19.0-41 which kernel faults.
A binary search between 25 and 41, showed finally, that linux 3.19.0-33 works, whereas the next available version: 3.19.0-37 fails again.
Unfortunately, I don’t know enough to be able to debug further, but I’m really hoping someone from NVIDIA would be able to verify my problem or dig into it more. I’d really like to be able to upgrade my ubuntu to the latest version. (Although, at least now that I have this, I can upgrade, and then downgrade only the linux kernel…)
So… That’s about it… Actually… Has anyone else seen this issue? Or is my configuration so unique that it’s a problem only for me? In any case. I’m posting it here so someone might be able to dig further. Thanks!