[TF1.7][TRT3.0.4] Load TRT optimized saved_model to do inference, but sess run failed at "Cuda Error in gieCudaMalloc"


What am I trying to do:

  • With an already trained TF model(frozen), I restored it and successfully optimized it by calling TRT python API. Then I exported the optimized model to saved_model. Then I coded a new script that intended to load the saved_model and do the inference. A CUDA malloc related ERROR popped up.


2018-05-03 10:08:10.710528: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-05-03 10:08:11.315302: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-05-03 10:08:11.315719: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: 
name: Tesla P4 major: 6 minor: 1 memoryClockRate(GHz): 1.1135
pciBusID: 0000:00:08.0
totalMemory: 7.43GiB freeMemory: 7.31GiB
2018-05-03 10:08:11.315757: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-05-03 10:08:11.886331: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-05-03 10:08:11.886496: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 
2018-05-03 10:08:11.886574: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N 
2018-05-03 10:08:11.886928: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7046 MB memory) -> physical GPU (device: 0, name: Tesla P4, pci bus id: 0000:00:08.0, compute capability: 6.1)
2018-05-03 10:08:14.007927: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger resources.cpp (199) - Cuda Error in gieCudaMalloc: 2
terminate called after throwing an instance of 'nvinfer1::CudaError'
  what():  std::exception

My script:

with tf.Session(graph=tf.Graph()) as sess:
        tf.saved_model.loader.load(sess, ['serve'], './fs2')
        graph = tf.get_default_graph()
        for op in tf.get_default_graph().get_operations():
            print str(op.name)                                                                                                                                 
        inp = graph.get_tensor_by_name("import/Placeholder:0")
        oup = graph.get_tensor_by_name("import/InceptionV3/Logits/SpatialSqueeze:0")
        sess.run(oup, {inp: batch_input})

I think my overall procedure should be appropriate.
If anyone can share some comments on this, it will be much grateful.


This is a CUDA memory allocation issue.
I solved it be adding:
sess = tf.Session(config=tf.ConfigProto(gpu_options=tf.GPUOptions(per_process_gpu_memory_fraction=0.50)))