Crash on training (CUDA_ERROR_LAUNCH_FAILED)

progr.mail · June 7, 2019, 9:41am

Code to reproduce the issue

# tested on devel-gpu-py3 (where tf wheel builded) and latest-gpu-py3
docker pull tensorflow/tensorflow:latest-gpu-py3

# deep into docker image
docker run --runtime=nvidia -it  -v ~/projects/gpt2-simple:/gpt2 tensorflow/tensorflow:latest-gpu-py3 bash

In Docker container:

pip uninstall tensorflow-gpu

cd /gpt2
# from https://github.com/yaroslavvb/tensorflow-community-wheels/issues/109
pip install tensorflow-1.13.1-cp35-cp35m-linux_x86_64.whl

# install gpt2-simple
pip install gpt_2_simple==0.4.2

# from https://gist.github.com/saippuakauppias/4f41ce1072a04588a2bab7dae00f9bb7
python imdb_reviews.py

Every time I get:

root@63f592f02a0e:/gpt2# python imdb_reviews.py
...
2019-05-14 07:05:22.669722: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2019-05-14 07:05:22.669812: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1
Aborted (core dumped)

My tf_env looks like as if he felt bad…

But!
Tensorflow debugger with mnist works fine and use GPU (I see it in nvidia-smi):

root@4697d32838f4:/gpt2# python -m tensorflow.python.debug.examples.debug_mnist
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/debug/examples/debug_mnist.py:46: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Please write your own downloading logic.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/mnist_data/train-images-idx3-ubyte.gz
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/mnist_data/train-labels-idx1-ubyte.gz
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:110: dense_to_one_hot (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.one_hot on tensors.
Extracting /tmp/mnist_data/t10k-images-idx3-ubyte.gz
Extracting /tmp/mnist_data/t10k-labels-idx1-ubyte.gz
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: DataSet.__init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
2019-05-14 17:07:28.376774: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2800000000 Hz
2019-05-14 17:07:28.378926: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x6115c40 executing computations on platform Host. Devices:
2019-05-14 17:07:28.378987: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-05-14 17:07:28.598760: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-14 17:07:28.601785: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x61c1dd0 executing computations on platform CUDA. Devices:
2019-05-14 17:07:28.601829: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2019-05-14 17:07:28.602453: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6575
pciBusID: 0000:0b:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2019-05-14 17:07:28.602502: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-05-14 17:07:28.605463: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-14 17:07:28.605502: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-05-14 17:07:28.605525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-05-14 17:07:28.605978: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10470 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:0b:00.0, compute capability: 6.1)
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-05-14 17:07:29.756412: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
Accuracy at step 0: 0.1094
Accuracy at step 1: 0.098
Accuracy at step 2: 0.098
Accuracy at step 3: 0.098
Accuracy at step 4: 0.098
Accuracy at step 5: 0.098
Accuracy at step 6: 0.098
Accuracy at step 7: 0.098
Accuracy at step 8: 0.098
Accuracy at step 9: 0.098

System information

Have I written custom code: used GitHub - minimaxir/gpt-2-simple: Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts for finetuning
OS Platform: Ubuntu 16.04 & Docker
TensorFlow installed from (source or binary): custom build TF (without AVX) from TF 1.13.1 GPU (CUDA 10.0 cuDNN 7.4) without AVX, Ubuntu 16.04, Python 3.5 [from official docker image] · Issue #109 · yaroslavvb/tensorflow-community-wheels · GitHub
TensorFlow version (use command below): b’v1.13.1-0-g6612da8’ 1.13.1
Python version: Python 3.5.2
Bazel version (if compiling from source): 0.21.0
GCC/Compiler version (if compiling from source): Same as docker tensorflow/tensorflow:latest-gpu-py3 & tensorflow/tensorflow:devel-gpu-py3 (error reproduce in both)
CUDA/cuDNN version: 10.0 / 7.4.1.5-1 ( tensorflow/devel-gpu.Dockerfile at master · tensorflow/tensorflow · GitHub )
GPU model and memory: GeForce GTX 1080 Ti (11 Gb)

PS: tensorflow members/support could not help for a month, and did not try…

Tell me, please, at least what should I try to do?

progr.mail · June 29, 2019, 8:38pm

Any ideas?..

kus · July 30, 2019, 9:02am

Could you try with lower version of tensorflow-gpu (below 1.12) and see if you have the same problem.

progr.mail · August 7, 2019, 8:12pm

I tested on v1.12.2 and v1.11.0 and now - on v1.14.0.
Same results :(

I have already reinstalled docker-ce and nvidia-docker (to new versions).

Maybe there are more ways to solve this?

kus · August 8, 2019, 1:24am

Ok. My experience is on a slightly different problem – the error is not always repeated but frequently happened. The resolution we had was to use TF version 1.8, and the problem did not occur yet.

progr.mail · August 9, 2019, 10:28am

Hello! Thanks for the tip, but that didn’t help :(

When I try to run the script on the 1.8 docker image, I just got an error (“… core dumped”).
But I built version 1.8 for my CPU (without avx), launched it, and the server (Ubuntu as guest OS and vMWare ESXi as host) turned off…

progr.mail · October 5, 2019, 9:27pm

The problem was in the video card. There were memory errors (can be verified through GitHub - ForkLab/cuda_memtest: Fork of CUDA GPU memtest or OCCT in Windows).
Solution: replace the video card.