I have an Asus GTX 1080 ti strix OC edition that fails after a few minutes whenever I train machine learning models in Keras (within python using tensorflow version 2.0.0).
The GPU is using driver version 441.41 but I don’t think the error is related to the driver, because it has been crashing for about a year now and I have used countless driver versions in that period. I am using windows 10 64 bit.
The error message python throws during the crash can take a few different forms. For example:
RuntimeError: Error copying tensor to device: CPU:0. GPU sync failed
Or
InternalError: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size]: [1, 64, 64, 1, 100, 24]
[[{{node unified_lstm_1/CudnnRNN}}]] [Op:__inference_keras_scratch_graph_652]
Or
An error ocurred while starting the kernel
2019 09:58:29.815934: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019 09:58:29.820282: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library nvcuda.dll
2019 09:58:29.927727: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1467] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:01:00.0
totalMemory: 11.00GiB freeMemory: 9.11GiB
2019 09:58:29.929847: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1546] Adding visible gpu devices: 0
2019 09:58:30.283113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1015] Device interconnect StreamExecutor with strength 1 edge matrix:
2019 09:58:30.283500: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1021] 0
2019 09:58:30.283730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1034] 0: N
2019 09:58:30.284228: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1149] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8791 MB memory) ‑> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2019 09:58:52.843169: E tensorflow/stream_executor/cuda/cuda_driver.cc:1039] could not synchronize on CUDA context: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure ::
2019 09:59:17.108062: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:753] failed to record completion event; therefore, failed to create inter‑stream dependency
2019 09:59:17.108527: I tensorflow/stream_executor/stream.cc:4800] [stream=000001A0C6B91B40,impl=000001A0C982D2E0] did not memcpy host‑to‑device; source: 000001A0B720D040
2019 09:59:17.108970: E tensorflow/stream_executor/stream.cc:331] Error recording event in stream: error recording CUDA event on stream 000001A0C5D150F0: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2019 09:59:17.109725: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2019 09:59:17.110190: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
Often when this error has been generated the sound on my computer also stops working (I have the HDMI on the GPU connected to my receiver for sound). I don’t play a lot of games on the PC, but when I do I have never experienced the GPU crashing while gaming. And also the temperature of the GPU is not very high (50-70 °C) when the errors occur.
Does anyone have any idea on what could be causing this and what I can try to do to fix it?
Software updates, factory reset the GPU, update the GPU BIOS or something else? Any advice would be greatly appreciated.