Issue with Cuda 8 and python on Linux (Ubuntu 16.04.03, kernel 13.1)

ricardocuenk · September 20, 2017, 5:15am

Hi.
I’m having problems with cuda and python (pytorch). Running a script will usually result in one of these three errors: Cuda runtime Error 4, CUDNN_STATUS_INTERNAL_ERROR, Segmentation fault. However the first time I execute a script after booting it will run just fine. I’ve checked with nvidia-smi that the memory is freed after the script terminates.
deviceQuery and bandWidthTest from the provided Cuda examples run fine and both report “PASS”.

The following logs are filed after the bugs I’ve experienced:

Error 1:
RuntimeError: cuda runtime error (4) : unspecified launch failure at /home/ric/ptDocker/pytorch/torch/lib/THC/generic/THCTensorCopy.c:18

[ 1204.635697] NVRM: GPU Board Serial Number:
[ 1204.635702] NVRM: Xid (PCI:0000:23:00): 69, Class Error: ChId 0028, Class 0000c1c0, Offset 00001b00, Data 00040002, ErrorCode 0000000c

Error 2:
torch.backends.cudnn.CuDNNError: 4: b’CUDNN_STATUS_INTERNAL_ERROR’

[ 1570.852521] NVRM: Xid (PCI:0000:23:00): 31, Ch 00000028, engmask 00000101, intr 10000000

Error 3:
Segmentation fault (core dumped)

[ 2521.280234] NVRM: Xid (PCI:0000:23:00): 12, Ch 00000029 Cl 0000c1c0 Off 000001b8 Data 04095b00
[ 2521.282631] NVRM: Xid (PCI:0000:23:00): 12, Ch 00000028 Cl 0000c1c0 Off 000001bc Data 00000102
[ 2521.282779] NVRM: Xid (PCI:0000:23:00): 12, Ch 00000028 Cl 0000c1c0 Off 000001c0 Data 00000020
[ 2521.284353] python[4719]: segfault at 500000004 ip 00007f7ebfd71815 sp 00007ffe0a5a6cb8 error 4 in libcuda.so.375.66[7f7ebfbe6000+6c2000]

I haven’t experienced issues with my graphics output (other than minor freezing right after a script fails).

I’ve tried using different versions of the python package and they all fail. I’ve also tried reinstalling Cuda and changing the driver (I’ve tried using the runfile installations for versions 375.66 and 384.69, but currently I’m using driver version 375.66 installed through Ubuntu’s package manager). I’ve also tried reinstalling the operating system from scratch and switching to an earlier kernel (10.0-3).

I’m attaching the the content of /proc/cpuinfo and the logs after a series of three tests (which generated the above errors) as well as the bug-report.log.

Any help would be greatly appreciated.

My system specifications are:
CPU: AMD Ryzen 1600
GPU Gtx 1080 (Zotac AMP!)
Chipset: B350 (MSI B350 PC Mate)
OS: Ubuntu 16.04.03, kernel: 13.1
Cuda: 8.0.61 + cuBLAS patch (runfile installation)
Nvidia driver: 375.66 (package manager installation)
cuDNN: 6.021
nvidia-bug-report.log.gz (133 KB)
cpu.txt (15.5 KB)
dmesg.txt (60.6 KB)

sandipt · September 22, 2017, 8:41am

Is it possible for you to provide sample app/program to reproduce this issue. What are the other software components needed to run your app/program? How to install pytorch if needed ?

ricardocuenk · September 27, 2017, 5:29am

I am currently using Anaconda ( Anaconda | Anaconda Distribution ) to handle python and the corresponding packages. I use a conda enviroment with python 3.5. To create an environment with the required components (using conda):

conda create -n python=3.5
source activate
conda install pytorch torchvision cuda80 -c soumith

Alternatively pytorch may be set up using pip:
pip3 install http://download.pytorch.org/whl/cu80/torch-0.2.0.post3-cp35-cp35m-manylinux1_x86_64.whl
pip3 install torchvision

I am attaching a very simple script that can be used to randomly reproduce the results. The script will usually execute fine the first time and will fail at some point if it is executed again. Usually interrupting the script will cause it to fail the next time it is run.

However, after further testing I found out my CPU may be affected by a hardware issue present in some Ryzen processors that causes system instability under various conditions. It is likely that this is related to the problems I’ve experienced with Cuda and python.
I am currently in the process of replacing the CPU. I will update this Topic as soon as I test the system with a new CPU.
crash.zip (915 Bytes)

ricardocuenk · October 17, 2017, 3:19pm

Hi. I’ve replaced the CPU in my system and after some testing I can say that the issues I was experiencing were indeed hardware-related. With the new CPU cuda has been working fine with my programs.
Thanks!

younky.yang · December 3, 2017, 1:21am

I encountered the XiD error when running KDE applications, It happens unexpected, but after it happens, only the mouse cursor is seeable. all the other stuff just freeze.

From the log file, I can see the error
12月 03 08:45:41 gentoo-amd kernel: NVRM: GPU at PCI:0000:42:00: GPU-f57007b3-0a41-f62d-67dd-8648df008e8a
12月 03 08:45:41 gentoo-amd kernel: NVRM: GPU Board Serial Number:
12月 03 08:45:41 gentoo-amd kernel: NVRM: Xid (PCI:0000:42:00): 31, Ch 00000020, engmask 00000101, intr 10000000

With some google search, many people say this is a driver issue. Did nvidia reproduce the error and can provide a fix on it?

Topic		Replies	Views
RuntimeError: cuda runtime error (8) CUDA Programming and Performance	1	2076	August 29, 2018
Torch/torchvision on Orin NX 16GB Segfault Jetson Orin NX pytorch	14	1094	April 5, 2023
Help fix installation error nvidia-driver,cuda,cudnn,torch,tensorrt suitable for Ubuntu20.04 x86_64 TensorRT tensorrt , cuda , ubuntu , pytorch , cudnn	1	134	August 30, 2024
Segmentation fault in JetPack 5.1 container when using CUDA device in PyTorch Jetson Xavier NX cuda , docker , pytorch , python	8	1015	March 30, 2023
Cannot make samples files cuda 8 & Ubuntu 16.04 CUDA Setup and Installation	0	589	January 24, 2018
Torch, Ubuntu 14.04.3, CUDA 7.5.18, NVIDIA 352.39, Linux 3.19.0-37 and greater kernel faults CUDA Setup and Installation	6	5554	March 21, 2016
CUDA error while running .cuda() function CUDA Setup and Installation	0	1119	July 15, 2019
Titan RTX memory access error/malfunction/bug Linux	0	619	March 6, 2020
Segmentation fault on the simplest example CUDA Developer Tools	0	809	October 25, 2020
building with tensor rt 5.1.2 , cuda 10.0 ,cudnn 7.6.4 on Titan Rtx 24gb TensorRT	1	810	December 4, 2019

Issue with Cuda 8 and python on Linux (Ubuntu 16.04.03, kernel 13.1)

Related topics