Hi.
I’m having problems with cuda and python (pytorch). Running a script will usually result in one of these three errors: Cuda runtime Error 4, CUDNN_STATUS_INTERNAL_ERROR, Segmentation fault. However the first time I execute a script after booting it will run just fine. I’ve checked with nvidia-smi that the memory is freed after the script terminates.
deviceQuery and bandWidthTest from the provided Cuda examples run fine and both report “PASS”.
The following logs are filed after the bugs I’ve experienced:
Error 1:
RuntimeError: cuda runtime error (4) : unspecified launch failure at /home/ric/ptDocker/pytorch/torch/lib/THC/generic/THCTensorCopy.c:18
[ 1204.635697] NVRM: GPU Board Serial Number:
[ 1204.635702] NVRM: Xid (PCI:0000:23:00): 69, Class Error: ChId 0028, Class 0000c1c0, Offset 00001b00, Data 00040002, ErrorCode 0000000c
Error 2:
torch.backends.cudnn.CuDNNError: 4: b’CUDNN_STATUS_INTERNAL_ERROR’
[ 1570.852521] NVRM: Xid (PCI:0000:23:00): 31, Ch 00000028, engmask 00000101, intr 10000000
Error 3:
Segmentation fault (core dumped)
[ 2521.280234] NVRM: Xid (PCI:0000:23:00): 12, Ch 00000029 Cl 0000c1c0 Off 000001b8 Data 04095b00
[ 2521.282631] NVRM: Xid (PCI:0000:23:00): 12, Ch 00000028 Cl 0000c1c0 Off 000001bc Data 00000102
[ 2521.282779] NVRM: Xid (PCI:0000:23:00): 12, Ch 00000028 Cl 0000c1c0 Off 000001c0 Data 00000020
[ 2521.284353] python[4719]: segfault at 500000004 ip 00007f7ebfd71815 sp 00007ffe0a5a6cb8 error 4 in libcuda.so.375.66[7f7ebfbe6000+6c2000]
I haven’t experienced issues with my graphics output (other than minor freezing right after a script fails).
I’ve tried using different versions of the python package and they all fail. I’ve also tried reinstalling Cuda and changing the driver (I’ve tried using the runfile installations for versions 375.66 and 384.69, but currently I’m using driver version 375.66 installed through Ubuntu’s package manager). I’ve also tried reinstalling the operating system from scratch and switching to an earlier kernel (10.0-3).
I’m attaching the the content of /proc/cpuinfo and the logs after a series of three tests (which generated the above errors) as well as the bug-report.log.
Any help would be greatly appreciated.
My system specifications are:
CPU: AMD Ryzen 1600
GPU Gtx 1080 (Zotac AMP!)
Chipset: B350 (MSI B350 PC Mate)
OS: Ubuntu 16.04.03, kernel: 13.1
Cuda: 8.0.61 + cuBLAS patch (runfile installation)
Nvidia driver: 375.66 (package manager installation)
cuDNN: 6.021
nvidia-bug-report.log.gz (133 KB)
cpu.txt (15.5 KB)
dmesg.txt (60.6 KB)