Training quickdraw model using CudnnLSTM leads to CUDNN_STATUS_EXECUTION_FAILED

breadywhs · February 1, 2019, 7:24am

Hi,

Ubuntu 16.04, Tesla V100 on AWS p3-2xlarge, Nvidia driver 396.54, Cuda 9.0.176_384.81, CuDNN 9.0
Tensorflow GPU 1.9.0, Python 3.6 using pyenv

I was curious about the Google Quickdraw game and was doing some research on how they trained the model.

I followed the file at

https://github.com/tensorflow/models/blob/master/tutorials/rnn/quickdraw/train_model.py

to run the following command

python train_model.py
–training_data train_data
–eval_data eval_data
–model_dir /tmp/quickdraw_model/
–cell_type cudnn_lstm

The training and eval data were generated using

https://github.com/tensorflow/models/blob/master/tutorials/rnn/quickdraw/create_dataset.py

and using the files here: https://console.cloud.google.com/storage/browser/quickdraw_dataset/full/simplified

Then the program stops after giving the following errors:

2019-02-01 06:41:15.770071: E tensorflow/stream_executor/cuda/cuda_dnn.cc:943] CUDNN_STATUS_EXECUTION_FAILED Failed to set dropout descriptor with state memory size: 3932160 bytes.
2019-02-01 06:41:15.770123: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at cudnn_rnn_ops.cc:1214 : Unknown: CUDNN_STATUS_EXECUTION_FAILED Failed to set dropout descriptor with state memory size: 3932160 bytes.
Traceback (most recent call last):
File “/home/ubuntu/.pyenv/versions/abc/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1322, in _do_call
return fn(*args)
File “/home/ubuntu/.pyenv/versions/abc/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1307, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File “/home/ubuntu/.pyenv/versions/abc/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1409, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: CUDNN_STATUS_EXECUTION_FAILED Failed to set dropout descriptor with state memory size: 3932160 bytes.
[[Node: cudnn_lstm/CudnnRNN = CudnnRNN[T=DT_FLOAT, direction=“bidirectional”, dropout=0.3, input_mode=“linear_input”, is_training=true, rnn_mode=“lstm”, seed=0, seed2=0, _device=“/job:localhost/replica:0/task:0/device:GPU:0”](transpose, cudnn_lstm/zeros, cudnn_lstm/zeros, cudnn_lstm/opaque_kernel/read)]]
[[Node: OptimizeLoss/clip_by_global_norm/mul_1/_239 = _Recvclient_terminated=false, recv_device=“/job:localhost/replica:0/task:0/device:CPU:0”, send_device=“/job:localhost/replica:0/task:0/device:GPU:0”, send_device_incarnation=1, tensor_name=“edge_354_OptimizeLoss/clip_by_global_norm/mul_1”, tensor_type=DT_FLOAT, _device=“/job:localhost/replica:0/task:0/device:CPU:0”]]

After doing some research, it seems the error was given by calling function cudnnSetDropoutDescriptor.

https://github.com/tensorflow/tensorflow/blob/r1.9/tensorflow/stream_executor/cuda/cuda_dnn.cc#L932

After checking the API docs, it seems CUDNN_STATUS_EXECUTION_FAILED is probably caused by library bugs or wrong installation.

I checked the installation by running the mnist test, and it passed.

Btw, I also tried running the above command without the cell_type param, which means it will run on CPU. It was able to run without any problem. Also, I tried running the same thing using the following setup, and it gave the same errors.

Ubuntu 18.04, Tesla V100 on AWS p3.2xlarge, NVidia Driver 410.79, Cuda 10.0.130_410.48, CuDNN 10.0,
Tensorflow GPU 12.0/10.0, Python 3.6 using pyenv

Have anyone ever tried running this and encountered similar problems?

spYYM · April 17, 2019, 6:22am

Encountered same problem:

2019-04-16T13:45:47.761Z: [1,0]:tensorflow.python.framework.errors_impl.UnknownError[1,0]:: CUDNN_STATUS_EXECUTION_FAILED
2019-04-16T13:45:47.761Z: [1,0]:in tensorflow/stream_executor/cuda/cuda_dnn.cc(953): ‘cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)’
2019-04-16T13:45:47.761Z: [1,0]: [[node child/layer_4/recur_6/cell/cudnn_gru/CudnnRNN (defined at /tmp/tmpdh7OwL.py:111) ]]

Topic		Replies	Views
Unable to complete Deep Learning training. Error: CUDNN_STATUS_EXECUTION_FAILED Frameworks (archived) tensorflow	0	1209	September 2, 2020
TensorFlow CUDNN_STATUS_EXECUSION_FAILED cuDNN tensorflow	1	1307	May 21, 2021
Cudnn_status_execution_failed cuDNN	1	2054	April 15, 2021
CUDNN_STATUS_EXECUTION_FAILED cuDNN	2	3818	May 22, 2020
Tensorflow Check failed: status == CUDNN_STATUS_SUCCESS (7 vs. 0)Failed to set cuDNN stream Frameworks (archived) cuda , tensorflow	0	1026	June 27, 2020
cudnn status execution failed error 2080ti cuDNN	2	988	June 19, 2019
Gettig "RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED" cuDNN	1	4446	May 2, 2020
ERROR: cudnn failure (CUDNN_STATUS_EXECUTION_FAILED) in mnistCUDNN.cpp:625 cuDNN cudnn	4	6761	February 23, 2021
How to deal with the problem that a cudnn erro was reported when i run my training program? cuDNN	1	2868	May 12, 2020
ERROR: Check failed: status == CUDNN_STATUS_SUCCESS (8 vs. 0) CUDNN_STATUS_EXECUTION_FAILED, device 0 TensorRT cuda , tensorflow , jetson-inference	1	1157	May 10, 2021

Training quickdraw model using CudnnLSTM leads to CUDNN_STATUS_EXECUTION_FAILED

Related topics