Crash on training (CUDA_ERROR_LAUNCH_FAILED)

Code to reproduce the issue

# tested on devel-gpu-py3 (where tf wheel builded) and latest-gpu-py3
docker pull tensorflow/tensorflow:latest-gpu-py3

# deep into docker image
docker run --runtime=nvidia -it  -v ~/projects/gpt2-simple:/gpt2 tensorflow/tensorflow:latest-gpu-py3 bash

In Docker container:

pip uninstall tensorflow-gpu

cd /gpt2
# from https://github.com/yaroslavvb/tensorflow-community-wheels/issues/109
pip install tensorflow-1.13.1-cp35-cp35m-linux_x86_64.whl

# install gpt2-simple
pip install gpt_2_simple==0.4.2

# from https://gist.github.com/saippuakauppias/4f41ce1072a04588a2bab7dae00f9bb7
python imdb_reviews.py

Every time I get:

root@63f592f02a0e:/gpt2# python imdb_reviews.py
...
2019-05-14 07:05:22.669722: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2019-05-14 07:05:22.669812: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1
Aborted (core dumped)

My tf_env looks like as if he felt bad…

But!
Tensorflow debugger with mnist works fine and use GPU (I see it in nvidia-smi):

root@4697d32838f4:/gpt2# python -m tensorflow.python.debug.examples.debug_mnist
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/debug/examples/debug_mnist.py:46: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Please write your own downloading logic.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/mnist_data/train-images-idx3-ubyte.gz
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/mnist_data/train-labels-idx1-ubyte.gz
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:110: dense_to_one_hot (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.one_hot on tensors.
Extracting /tmp/mnist_data/t10k-images-idx3-ubyte.gz
Extracting /tmp/mnist_data/t10k-labels-idx1-ubyte.gz
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: DataSet.__init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
2019-05-14 17:07:28.376774: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2800000000 Hz
2019-05-14 17:07:28.378926: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x6115c40 executing computations on platform Host. Devices:
2019-05-14 17:07:28.378987: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-05-14 17:07:28.598760: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-14 17:07:28.601785: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x61c1dd0 executing computations on platform CUDA. Devices:
2019-05-14 17:07:28.601829: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2019-05-14 17:07:28.602453: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6575
pciBusID: 0000:0b:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2019-05-14 17:07:28.602502: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-05-14 17:07:28.605463: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-14 17:07:28.605502: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-05-14 17:07:28.605525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-05-14 17:07:28.605978: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10470 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:0b:00.0, compute capability: 6.1)
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-05-14 17:07:29.756412: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
Accuracy at step 0: 0.1094
Accuracy at step 1: 0.098
Accuracy at step 2: 0.098
Accuracy at step 3: 0.098
Accuracy at step 4: 0.098
Accuracy at step 5: 0.098
Accuracy at step 6: 0.098
Accuracy at step 7: 0.098
Accuracy at step 8: 0.098
Accuracy at step 9: 0.098

System information

Have I written custom code: used https://github.com/minimaxir/gpt-2-simple for finetuning
OS Platform: Ubuntu 16.04 & Docker
TensorFlow installed from (source or binary): custom build TF (without AVX) from https://github.com/yaroslavvb/tensorflow-community-wheels/issues/109
TensorFlow version (use command below): b’v1.13.1-0-g6612da8’ 1.13.1
Python version: Python 3.5.2
Bazel version (if compiling from source): 0.21.0
GCC/Compiler version (if compiling from source): Same as docker tensorflow/tensorflow:latest-gpu-py3 & tensorflow/tensorflow:devel-gpu-py3 (error reproduce in both)
CUDA/cuDNN version: 10.0 / 7.4.1.5-1 ( https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dockerfiles/dockerfiles/devel-gpu.Dockerfile )
GPU model and memory: GeForce GTX 1080 Ti (11 Gb)

PS: tensorflow members/support could not help for a month, and did not try…

Tell me, please, at least what should I try to do?

Any ideas?..

Could you try with lower version of tensorflow-gpu (below 1.12) and see if you have the same problem.

I tested on v1.12.2 and v1.11.0 and now - on v1.14.0.
Same results :(

I have already reinstalled docker-ce and nvidia-docker (to new versions).

Maybe there are more ways to solve this?

Ok. My experience is on a slightly different problem – the error is not always repeated but frequently happened. The resolution we had was to use TF version 1.8, and the problem did not occur yet.

Hello! Thanks for the tip, but that didn’t help :(

When I try to run the script on the 1.8 docker image, I just got an error ("… core dumped").
But I built version 1.8 for my CPU (without avx), launched it, and the server (Ubuntu as guest OS and vMWare ESXi as host) turned off…

The problem was in the video card. There were memory errors (can be verified through https://github.com/ForkLab/cuda_memtest or OCCT in Windows).
Solution: replace the video card.