TF-TRT Cuda error on Pegasus

Hi,

I’m using Pegasus and trying to build a TF-TRT model using trt.create_inference_graph.

When trying to generate the model, these errors occured:

2019-06-10 16:51:26.724539: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger engine.cpp (99) - Cuda Error in initializeCommonContext: 4 (Could not initialize cudnn, please check cudnn installation.)
2019-06-10 16:51:26.738716: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger engine.cpp (99) - Cuda Error in initializeCommonContext: 4 (Could not initialize cudnn, please check cudnn installation.)
2019-06-10 16:51:26.739282: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:511] Engine creation for batch size 16 failed Internal: Failed to build TensorRT engine
2019-06-10 16:51:26.739366: W tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:290] Engine retrieval for batch size 1 failed. Running native segment for TRTEngineOp_0
2019-06-10 16:51:26.846929: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-06-10 16:51:26.861626: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-06-10 16:51:26.863472: E tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:180] Failed to execute native segment TRTEngineOp_0: Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node MobilenetV2/Conv/Conv2D}}]]
Exception in thread Thread-8:
Traceback (most recent call last):
File “/home/nvidia/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py”, line 1334, in _do_call
return fn(*args)
File “/home/nvidia/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py”, line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File “/home/nvidia/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py”, line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node MobilenetV2/Conv/Conv2D}}]]
[[{{node SemanticPredictions}}]]

I’m not pretty sure if is the problem that tensorflow version does not match the version of cudnn or TRT.
All these code run on Pegasus.

  • TRT 5.0.3
  • Tensorflow 1.13.0-rc0
  • cudnn 7.3.1
  • CUDA 10.0

Could you help me on this issue?
Thanks.

This could be due to OOM. Could you try to reduce the TF GPU memory fraction: config.gpu_options.per_process_gpu_memory_fraction

Thank you, limiting TF GPU mem fraction fixed this for me!