GPU crashes when running machine learning models

I have an Asus GTX 1080 ti strix OC edition that fails after a few minutes whenever I train machine learning models in Keras (within python using tensorflow version 2.0.0).

The GPU is using driver version 441.41 but I don’t think the error is related to the driver, because it has been crashing for about a year now and I have used countless driver versions in that period. I am using windows 10 64 bit.

The error message python throws during the crash can take a few different forms. For example:

RuntimeError: Error copying tensor to device: CPU:0. GPU sync failed


InternalError: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size]: [1, 64, 64, 1, 100, 24] 
	 [[{{node unified_lstm_1/CudnnRNN}}]] [Op:__inference_keras_scratch_graph_652]


An error ocurred while starting the kernel
2019 09:58:29.815934: I tensorflow/core/platform/] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019 09:58:29.820282: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library nvcuda.dll
2019 09:58:29.927727: I tensorflow/core/common_runtime/gpu/] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:01:00.0
totalMemory: 11.00GiB freeMemory: 9.11GiB
2019 09:58:29.929847: I tensorflow/core/common_runtime/gpu/] Adding visible gpu devices: 0
2019 09:58:30.283113: I tensorflow/core/common_runtime/gpu/] Device interconnect StreamExecutor with strength 1 edge matrix:
2019 09:58:30.283500: I tensorflow/core/common_runtime/gpu/] 0 
2019 09:58:30.283730: I tensorflow/core/common_runtime/gpu/] 0: N 
2019 09:58:30.284228: I tensorflow/core/common_runtime/gpu/] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8791 MB memory) ‑> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2019 09:58:52.843169: E tensorflow/stream_executor/cuda/] could not synchronize on CUDA context: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure :: 
2019 09:59:17.108062: E tensorflow/stream_executor/cuda/] failed to record completion event; therefore, failed to create inter‑stream dependency
2019 09:59:17.108527: I tensorflow/stream_executor/] [stream=000001A0C6B91B40,impl=000001A0C982D2E0] did not memcpy host‑to‑device; source: 000001A0B720D040
2019 09:59:17.108970: E tensorflow/stream_executor/] Error recording event in stream: error recording CUDA event on stream 000001A0C5D150F0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2019 09:59:17.109725: E tensorflow/stream_executor/cuda/] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2019 09:59:17.110190: F tensorflow/core/common_runtime/gpu/] Unexpected Event status: 1

Often when this error has been generated the sound on my computer also stops working (I have the HDMI on the GPU connected to my receiver for sound). I don’t play a lot of games on the PC, but when I do I have never experienced the GPU crashing while gaming. And also the temperature of the GPU is not very high (50-70 °C) when the errors occur.

Does anyone have any idea on what could be causing this and what I can try to do to fix it?

Software updates, factory reset the GPU, update the GPU BIOS or something else? Any advice would be greatly appreciated.

After switching to python 3.7 (from 3.6), uninstalling and reinstalling tensorflow and updating cuDNN to the latest version the crashing appears to have stopped! The training is however substantially slower now compared to before, so something is still strange, but at least it is not crashing at the moment!