Issue with Cuda 8 and python on Linux (Ubuntu 16.04.03, kernel 13.1)

Hi.
I’m having problems with cuda and python (pytorch). Running a script will usually result in one of these three errors: Cuda runtime Error 4, CUDNN_STATUS_INTERNAL_ERROR, Segmentation fault. However the first time I execute a script after booting it will run just fine. I’ve checked with nvidia-smi that the memory is freed after the script terminates.
deviceQuery and bandWidthTest from the provided Cuda examples run fine and both report “PASS”.

The following logs are filed after the bugs I’ve experienced:

Error 1:
RuntimeError: cuda runtime error (4) : unspecified launch failure at /home/ric/ptDocker/pytorch/torch/lib/THC/generic/THCTensorCopy.c:18

[ 1204.635697] NVRM: GPU Board Serial Number:
[ 1204.635702] NVRM: Xid (PCI:0000:23:00): 69, Class Error: ChId 0028, Class 0000c1c0, Offset 00001b00, Data 00040002, ErrorCode 0000000c

Error 2:
torch.backends.cudnn.CuDNNError: 4: b’CUDNN_STATUS_INTERNAL_ERROR’

[ 1570.852521] NVRM: Xid (PCI:0000:23:00): 31, Ch 00000028, engmask 00000101, intr 10000000

Error 3:
Segmentation fault (core dumped)

[ 2521.280234] NVRM: Xid (PCI:0000:23:00): 12, Ch 00000029 Cl 0000c1c0 Off 000001b8 Data 04095b00
[ 2521.282631] NVRM: Xid (PCI:0000:23:00): 12, Ch 00000028 Cl 0000c1c0 Off 000001bc Data 00000102
[ 2521.282779] NVRM: Xid (PCI:0000:23:00): 12, Ch 00000028 Cl 0000c1c0 Off 000001c0 Data 00000020
[ 2521.284353] python[4719]: segfault at 500000004 ip 00007f7ebfd71815 sp 00007ffe0a5a6cb8 error 4 in libcuda.so.375.66[7f7ebfbe6000+6c2000]

I haven’t experienced issues with my graphics output (other than minor freezing right after a script fails).

I’ve tried using different versions of the python package and they all fail. I’ve also tried reinstalling Cuda and changing the driver (I’ve tried using the runfile installations for versions 375.66 and 384.69, but currently I’m using driver version 375.66 installed through Ubuntu’s package manager). I’ve also tried reinstalling the operating system from scratch and switching to an earlier kernel (10.0-3).

I’m attaching the the content of /proc/cpuinfo and the logs after a series of three tests (which generated the above errors) as well as the bug-report.log.

Any help would be greatly appreciated.

My system specifications are:
CPU: AMD Ryzen 1600
GPU Gtx 1080 (Zotac AMP!)
Chipset: B350 (MSI B350 PC Mate)
OS: Ubuntu 16.04.03, kernel: 13.1
Cuda: 8.0.61 + cuBLAS patch (runfile installation)
Nvidia driver: 375.66 (package manager installation)
cuDNN: 6.021
nvidia-bug-report.log.gz (133 KB)
cpu.txt (15.5 KB)
dmesg.txt (60.6 KB)

Is it possible for you to provide sample app/program to reproduce this issue. What are the other software components needed to run your app/program? How to install pytorch if needed ?

I am currently using Anaconda ( Anaconda | Anaconda Distribution ) to handle python and the corresponding packages. I use a conda enviroment with python 3.5. To create an environment with the required components (using conda):

conda create -n python=3.5
source activate
conda install pytorch torchvision cuda80 -c soumith

Alternatively pytorch may be set up using pip:
pip3 install http://download.pytorch.org/whl/cu80/torch-0.2.0.post3-cp35-cp35m-manylinux1_x86_64.whl
pip3 install torchvision

I am attaching a very simple script that can be used to randomly reproduce the results. The script will usually execute fine the first time and will fail at some point if it is executed again. Usually interrupting the script will cause it to fail the next time it is run.

However, after further testing I found out my CPU may be affected by a hardware issue present in some Ryzen processors that causes system instability under various conditions. It is likely that this is related to the problems I’ve experienced with Cuda and python.
I am currently in the process of replacing the CPU. I will update this Topic as soon as I test the system with a new CPU.
crash.zip (915 Bytes)

Hi. I’ve replaced the CPU in my system and after some testing I can say that the issues I was experiencing were indeed hardware-related. With the new CPU cuda has been working fine with my programs.
Thanks!

I encountered the XiD error when running KDE applications, It happens unexpected, but after it happens, only the mouse cursor is seeable. all the other stuff just freeze.

From the log file, I can see the error
12月 03 08:45:41 gentoo-amd kernel: NVRM: GPU at PCI:0000:42:00: GPU-f57007b3-0a41-f62d-67dd-8648df008e8a
12月 03 08:45:41 gentoo-amd kernel: NVRM: GPU Board Serial Number:
12月 03 08:45:41 gentoo-amd kernel: NVRM: Xid (PCI:0000:42:00): 31, Ch 00000020, engmask 00000101, intr 10000000

With some google search, many people say this is a driver issue. Did nvidia reproduce the error and can provide a fix on it?