cuDNN crashes ever since an error during training

jbrt · June 27, 2018, 9:19am

OS: Ubuntu 16.04 LTS, tested also on Windows 10
CUDA: v9.0 (installed with .deb)
CuDNN: v7.0.5 for CUDA 9.0 (installed with .deb)
NVIDIA drivers: originally 384, now 396.26
GPU: GeForce GTX 1080 Ti

Hi All,

I was training and testing my stuff for a few weeks after getting a new GPU, without any problems. Suddenly, I got an error mid-training:

E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS
F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:208] Unexpected Event status: 1

Ever since, even the mnistCUDNN sample fails randomly. Sometimes it passes, sometimes fails with one of two errors:

Loading image data/five_28x28.pgm
Performing forward propagation ...
Cuda failure
Error: an illegal memory access was encountered
mnistCUDNN.cpp:605

Testing cudnnFindConvolutionForwardAlgorithm ...
CUDNN failure
Error: CUDNN_STATUS_ALLOC_FAILED
mnistCUDNN.cpp:558

CUDA is installed properly using the package manager, the CUDA samples pass. Other processes (Xorg, compiz, firefox…) don’t have any problems using the GPU. Tested also on a game on Win10 (same machine) - The Witcher 3 works fine.

Things I’ve already tried:

reinstalling CUDA and cuDNN (with package manager)
reinstalling Ubuntu (and then 1.)
testing on Win10 on the same machine (no cudnn samples, but Tensorflow (1.8.0) raises the same errors as it does on Ubuntu when trying to run TF samples)
Updating the drivers
Usind sudo to run the samples. The persistence daemon works properly and the problem occurs with persistence both enabled and disabled

Has anyone encountered similar problems before? Could it be a faulty GPU? Am I missing something…?

jbrt · July 16, 2018, 10:43am

Update: I contacted the general NVIDIA support but was advised to update this thread and wait for the team. I also edited the thread’s name, so hopefully it better explains the situation now (previously: “cuDNN sample fails since a random crash during training”).

yanxu · July 25, 2018, 9:45pm

Hello Jbrt, cuDNN team here, can you try 3 things so we have a better understanding of the issue?

can you run cuda-memcheck on the mnistCUDNN sample and post the results?
can you try something like memtest86 on your machine? From what we have seen, sometimes these kind of random “illegal memory access” may be caused by (host) ram failure.
Can you try the sample in the latest cuDNN v7.1.4 and see if the issue still remains?

Thanks!

jbrt · July 26, 2018, 6:45pm

Hello Yanxu, thanks for responding. I’ve just noticed I didn’t mention in my original post that I’ve actually already run cuda-memcheck.

The output of cuda-memcheck is not fully reproducible - I noticed three types of errors that are thrown under the exactly same circumstances. The output is pretty long, so I’ll try to shorten it down a bit:

1.1:

cudnnGetVersion() : 7104 , CUDNN_VERSION from cudnn.h : 7104 (7.1.4)
Host compiler version : GCC 5.4.0
There are 1 CUDA capable devices on your machine :
device 0 : sms 28  Capabilities 6.1, SmClock 1620.0 Mhz, MemSize (Mb) 11177, MemClock 5505.0 Mhz, Ecc=0, boardGroupID=0
Using device 0

Testing single precision
Loading image data/one_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm ...
Fastest algorithm is Algo 1
Testing cudnnFindConvolutionForwardAlgorithm ...
CUDNN failure
Error: CUDNN_STATUS_INTERNAL_ERROR
mnistCUDNN.cpp:558
Aborting...
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors

1.2:

cudnnGetVersion() : 7104 , CUDNN_VERSION from cudnn.h : 7104 (7.1.4)
Host compiler version : GCC 5.4.0
There are 1 CUDA capable devices on your machine :
device 0 : sms 28  Capabilities 6.1, SmClock 1620.0 Mhz, MemSize (Mb) 11177, MemClock 5505.0 Mhz, Ecc=0, boardGroupID=0
Using device 0

Testing single precision
Loading image data/one_28x28.pgm
========= CUDA-MEMCHECK
========= Invalid __global__ read of size 8
=========     at 0x00000098 in void fermiPlusCgemmLDS128_batched<bool=1, bool=0, bool=0, bool=0, int=4, int=4, int=4, int=3, int=3, bool=1, bool=0>(float2* const *, float2* const *, float2* const *, float2*, float2 const *, float2 const *, int, int, int, int, int, int, __int64, __int64, __int64, float2 const *, float2 const *, float2, float2, int)
=========     by thread (7,5,0) in block (0,0,99)
=========     Address 0x7fe6cf645518 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x2486ed]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x134d952]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x134db47]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x137c8d5]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe99abc]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe99b99]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe9acfc]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe9a6cb]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe7345b]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe6abce]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xcac2be]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xcac948]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xcb210c]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xcb3921]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x780fa3]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x842c7]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x846e6]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 (cudnnConvolutionForward + 0x2cc) [0x854ec]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x89368]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x8e993]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 (cudnnFindConvolutionForwardAlgorithm + 0x248) [0x7fa78]
=========     Host Frame:mnistCUDNN [0x189bb]
=========     Host Frame:mnistCUDNN [0x10d67]
=========     Host Frame:mnistCUDNN [0xe23b]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20830]
=========     Host Frame:mnistCUDNN [0x74d9]
=========

### similar as above, 6 times more ###

=========
========= Invalid __global__ read of size 8
=========     at 0x00000098 in void fermiPlusCgemmLDS128_batched<bool=1, bool=0, bool=0, bool=0, int=4, int=4, int=4, int=3, int=3, bool=1, bool=0>(float2* const *, float2* const *, float2* const *, float2*, float2 const *, float2 const *, int, int, int, int, int, int, __int64, __int64, __int64, float2 const *, float2 const *, float2, float2, int)
=========     by thread (1,5,0) in block (0,0,15)
=========     Address 0x7fe6cf645278 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x2486ed]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x134d952]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x134db47]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x137c8d5]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe99abc]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe99b99]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe9acfc]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe9a6cb]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe7345b]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe6abce]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xcac2be]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xcac948]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xcb210c]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xcb3921]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x780fa3]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x842c7]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x846e6]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 (cudnnConvolutionForward + 0x2cc) [0x854ec]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x89368]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x8e993]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 (cudnnFindConvolutionForwardAlgorithm + 0x248) [0x7fa78]
=========     Host Frame:mnistCUDNN [0x189bb]
=========     Host Frame:mnistCUDNN [0x10d67]
=========     Host Frame:mnistCUDNN [0xe23b]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20830]Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm ...
Fastest algorithm is Algo 1
Testing cudnnFindConvolutionForwardAlgorithm ...
CUDNN failure
Error: CUDNN_STATUS_INTERNAL_ERROR
mnistCUDNN.cpp:558
Aborting...

=========     Host Frame:mnistCUDNN [0x74d9]
=========
========= ERROR SUMMARY: 8 errors

1.3: as above, but with 4 errors of the same type.

Thanks, I ran memtest86 as advised - it passed with no errors.
I’ve just tested it with cuDNN v7.1.4 + CUDA 9.0 and cuDNN v7.1.4 + CUDA 9.2. The problem persists.

If you have any ideas, please let me know. If it’s a hardware issue, I can submit a warranty claim to the distributor. It’s very important for me to solve this as soon as possible. Thanks for help!

zzhanhuimei · March 20, 2019, 3:16am

Hello jbrt. I also met this problem with the 1080Ti, CentOS7, CUDA9.0, CuDNN7.1.4 for.I have reinstalled the NVdDIA driver, CuDNN and CUDA but neither of them works. So have you solved this problem?

jbrt · April 11, 2019, 5:10pm

Hi zzhanhuimei,

Sorry for a late response. I submitted a warranty claim, got my money back, got a new GPU - it works now.
Must’ve been a hardware problem, unfortunately.

michaelschartner · July 28, 2020, 10:20am

Hi,

I also got CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered when using a GeForce GTX 1080 Ti on Ubuntu. My convolutional neural network application worked well but then it got this error. I have also a GeForce RTX 2080 Ti Rev. A GPU on which the same application is still running (before I ran the same application in parallel on both GPU for many weeks). Reading the post here, I conclude that’s a hardware failure and I’ll send the GPU back?

Any advice appreciated, thanks!

Topic		Replies	Views
Intermittent CUDA_ERROR_ILLEGAL_ADDRESS error on Ubuntu 18.04 with TensorFlow 2.2.0 Frameworks cuda , tensorflow	3	7816	January 5, 2023
cuDNN fails with CUDNN_STATUS_INTERNAL_ERROR on MNIST sample execution cuDNN	12	8892	May 14, 2018
Failed to get convolution algorithm. This is probably because cuDNN failed to initialize cuDNN	29	51538	October 12, 2021
cuDNN Test did not pass cuDNN	24	16802	April 29, 2019
Failed cuDNN test (./mnistCUDNN) cuDNN	24	21048	June 15, 2023
"Failed to get convolution algorithm" problem cuDNN	4	8476	September 7, 2019
Program hit cudaErrorInvalidValue (error 11) due to "invalid argument" on CUDA API call to cudaGetLastError CUDA Programming and Performance	4	2941	August 23, 2019
mnistCUDNN Test Failed cuDNN cuda	4	2070	July 19, 2023
Error: CUDNN_STATUS_NOT_INITIALIZED (Titan Xp and Ubuntu 17.10) CUDA Programming and Performance	13	3883	March 30, 2018
Getting error, RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED while running a basic RNN model TensorRT pytorch	3	18730	April 17, 2023

cuDNN crashes ever since an error during training

Related topics