Evertyime I load Nvidia Driver 375.51 or 375.66 the P100 goes offline in about 8 minutes.
[jc@nvidia ~]$ nvidia-smi -L
GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-fbe9360d-e693-28a2-c055-20ecfd2857d8)
[jc@nvidia ~]$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 375.66 Mon May 1 15:29:16 PDT 2017
GCC version: gcc version 4.8.2 20140120 (Red Hat 4.8.2-16) (GCC)
[jc@nvidia ~]$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61
PROBLEM:
8-minutes online the P100 card goes bananas:
[jc@nvidia ~]$ nvidia-smi -L
Unable to determine the device handle for gpu 0000:03:00.0: Unknown Error
Any advice appreciated.
Does the fan work? Maybe it’s overheating and shutting down.
Run nvidia-bug-report.sh and attach output to post.
yes, fans working normally. The system works great with drivers less than or equal to
375.36 driver. As far as I can tell there something about the new drivers (375.51+) whenever we run 375.51 or 375.66 it goes into this catatonic state after about 6 minutes after the driver update. Here are some of my details:
Manufacturer: Supermicro
Product Name: SYS-7048GR-TR
note: I have looked at the temperature as well as verified the operation of all fans of the Chassis from within the MGMT interface:
FAN1 Normal 2200 R.P.M
FAN2 Normal 3500 R.P.M
FAN3 Normal 4700 R.P.M
FAN4 Normal 4700 R.P.M
FAN5 Normal 1900 R.P.M
FAN6 Normal 1900 R.P.M
FANA Normal 2700 R.P.M
FANB Normal 1100 R.P.M
FANC Normal 2300 R.P.M
FAND Normal 2300 R.P.M
FAN7 Normal 5700 R.P.M
COMMAND LINE DIAGNOSTICS:
[jmarosz@nvidia6 /tmp]$ sudo sh /usr/bin/nvidia-bug-report.sh
nvidia-bug-report.sh will now collect information about your
system and create the file ‘nvidia-bug-report.log.gz’ in the current
directory. It may take several seconds to run. In some
cases, it may hang trying to capture data generated dynamically
by the Linux kernel and/or the NVIDIA kernel module. While
the bug report log file will be incomplete if this happens, it
may still contain enough data to diagnose your problem.
Please include the ‘nvidia-bug-report.log.gz’ log file when reporting
your bug via the NVIDIA Linux forum (see devtalk.nvidia.com)
or by sending email to ‘linux-bugs@nvidia.com’.
Running nvidia-bug-report.sh…/usr/bin/nvidia-bug-report.sh: line 479: 3872 Segmentation fault (core dumped) $lspci -d “10de:*” -v -xxx 2> /dev/null
Failed to look up boot -1: Cannot assign requested address
Failed to look up boot -2: Cannot assign requested address
If the bug report script hangs after this point consider running with
–safe-mode command line argument.
complete.
By ‘output’ I meant the resulting nvidia-bug-report.log.gz file.
378-series drivers do also work? Edit: meant 381-series
i’ll redo things momentarily