GPU crashes when running machine learning models

pettes · November 30, 2019, 9:14am

I have an Asus GTX 1080 ti strix OC edition that fails after a few minutes whenever I train machine learning models in Keras (within python using tensorflow version 2.0.0).

The GPU is using driver version 441.41 but I don’t think the error is related to the driver, because it has been crashing for about a year now and I have used countless driver versions in that period. I am using windows 10 64 bit.

The error message python throws during the crash can take a few different forms. For example:

RuntimeError: Error copying tensor to device: CPU:0. GPU sync failed

Or

InternalError: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size]: [1, 64, 64, 1, 100, 24] 
	 [[{{node unified_lstm_1/CudnnRNN}}]] [Op:__inference_keras_scratch_graph_652]

Or

An error ocurred while starting the kernel
2019 09:58:29.815934: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019 09:58:29.820282: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library nvcuda.dll
2019 09:58:29.927727: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1467] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:01:00.0
totalMemory: 11.00GiB freeMemory: 9.11GiB
2019 09:58:29.929847: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1546] Adding visible gpu devices: 0
2019 09:58:30.283113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1015] Device interconnect StreamExecutor with strength 1 edge matrix:
2019 09:58:30.283500: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1021] 0 
2019 09:58:30.283730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1034] 0: N 
2019 09:58:30.284228: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1149] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8791 MB memory) ‑> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2019 09:58:52.843169: E tensorflow/stream_executor/cuda/cuda_driver.cc:1039] could not synchronize on CUDA context: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure :: 
2019 09:59:17.108062: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:753] failed to record completion event; therefore, failed to create inter‑stream dependency
2019 09:59:17.108527: I tensorflow/stream_executor/stream.cc:4800] [stream=000001A0C6B91B40,impl=000001A0C982D2E0] did not memcpy host‑to‑device; source: 000001A0B720D040
2019 09:59:17.108970: E tensorflow/stream_executor/stream.cc:331] Error recording event in stream: error recording CUDA event on stream 000001A0C5D150F0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2019 09:59:17.109725: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2019 09:59:17.110190: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1

Often when this error has been generated the sound on my computer also stops working (I have the HDMI on the GPU connected to my receiver for sound). I don’t play a lot of games on the PC, but when I do I have never experienced the GPU crashing while gaming. And also the temperature of the GPU is not very high (50-70 °C) when the errors occur.

Does anyone have any idea on what could be causing this and what I can try to do to fix it?

Software updates, factory reset the GPU, update the GPU BIOS or something else? Any advice would be greatly appreciated.

pettes · November 30, 2019, 12:18pm

After switching to python 3.7 (from 3.6), uninstalling and reinstalling tensorflow and updating cuDNN to the latest version the crashing appears to have stopped! The training is however substantially slower now compared to before, so something is still strange, but at least it is not crashing at the moment!

Topic		Replies	Views
Check failed: status == CUDNN_STATUS_SUCCESS (7 vs. 0) Failed to set cuDNN stream. cuDNN	2	3237	December 4, 2019
Crash on training (CUDA_ERROR_LAUNCH_FAILED) cuDNN	7	6843	October 12, 2021
Did TensorFlow caused GPU memory crash? CUDA Setup and Installation	5	5027	April 26, 2017
Repeated Beeping Noise and Loss Rapidly Decreasing When Training on Keras+Tensorflow Deep Learning (Training & Inference)	2	1268	October 12, 2021
GPU stuck during deep learning training cuDNN	4	1695	March 13, 2020
cuDNN crashes ever since an error during training cuDNN	7	6289	October 12, 2021
Failed to initialize GPU device #0: unknown error cuDNN	0	2267	April 23, 2019
Failed to synchronize the stop event: CUDA_ERROR_LAUNCH_TIMEOUT: the launch timed out and was terminated cuDNN	4	2428	July 12, 2021
My computer crashed when caffe ran CUDA Setup and Installation	0	554	September 20, 2016
Hard crash using CUDA on GTX 1080 Ti on Ubuntu 16.04 CUDA Setup and Installation	8	4921	September 25, 2017

GPU crashes when running machine learning models

Related topics