TRT5.0: Memory error when building engine

Hi all,

I wanted to give a quick try to TensorRT and ran into the following errors when building the engine from an UFF graph

  • [TensorRT] ERROR: Tensor: Conv_0/Conv2D at max batch size of 80 exceeds the maximum element count of 2147483647.
    To solve this problem I had to reduce the builder max_batch_size parameter to 50 or so. Note that this is much less than the maximum batch size I am able to run using Tensorflow (around 200 before encountering OutOfMemory error). Why is that so?
    (the convolution which the errors is referring to is a 3x3x1x64 convolution on patches of size 100x100)

  • [TensorRT] ERROR: runtime.cpp (24) - Cuda Error in allocate: 2
    I have had this error several times and absolutely no clue on what was causing it. One way of getting around was to reduce the max_workspace_size parameter of the builder to let’s say a third of the total GPU memory (5Gb on a P100 with 16GB).

All in all I am not sure that I fully grasped what is behind these max_batch_size and max_workspace_size parameters. Any hints would be greatly appreciated.

Thanks

Edit: using TRT 5.0.0.10 with Cuda 9.0 and CUDNN 7.3

Hello,

Cuda Error in allocate: 2 usually indicate the API call failed because it was unable to allocate enough memory to perform the requested operation.

When you referenced running tensorflow, was it on the same GPU?

Yes, the exact same GPU.
This error was when building the engine, exactly when calling

engine = builder.build_cuda_engine(network)

The returned value is None and the mentionned error is logged.

Please find enclosed a small .zip file with minimal set of dependencies to debug. Everything is pretty much explained in the main python script so I will be brief here. The zip contains:

  • A .pb file (tensorflow model export)

  • Converted uff graph

  • Tensor RT serialized engine

  • Python script which can be used to: run tensorflow on the reference inputs, build the TRT engine, run the TRT engine and compare with TF results
    (The zip contains also reference inputs and outputs as nparray but this is of no use here)

  • First of all I have noticed that if I run tensorflow and then build the TRT engine IN THE SAME PYTHON PROCESS by launching the script with all option, then I systematically get “[TensorRT] ERROR: runtime.cpp (24) - Cuda Error in allocate: 2”

  • Playing around with the max_batch_size, patch_size and max_workspace_size_gb parameters in the main python file also results in the errors described above (exceed max element count of xxx and Cuda Error in allocate)

Example

max_batch_size = 200
[TensorRT] ERROR: Tensor: Conv_0/Conv2D at max batch size of 200 exceeds the maximum element count of 2147483647

Example (running on a p100 with 16Gb memory)

max_workspace_size_gb = 8
[TensorRT] ERROR: runtime.cpp (24) - Cuda Error in allocate: 2
[TensorRT] ERROR: runtime.cpp (24) - Cuda Error in allocate: 2

Thanks for your help
test_nvidia.zip (2.49 MB)

thanks for the repro. I don’t see an all option. (do i just call run_reference_tensorflow(), build_trt_engine, run_trt_engine() inline?)

also, I’m getting following error

root@4639a43cf129:/home/scratch.zhenyih_sw/reproduce.2421196/test_nvidia# python test_trt_for_nvidia.py -o build_trt
[TensorRT] INFO: Creating uff graph
Traceback (most recent call last):
  File "test_trt_for_nvidia.py", line 263, in <module>
    runner[args.o]()
  File "test_trt_for_nvidia.py", line 138, in build_trt_engine
    subprocess.check_call(['convert-to-uff', '-o', uff_file, tf_pb_file])
  File "/usr/lib/python2.7/subprocess.py", line 536, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/usr/lib/python2.7/subprocess.py", line 523, in call
    return Popen(*popenargs, **kwargs).wait()
  File "/usr/lib/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/usr/lib/python2.7/subprocess.py", line 1343, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

root@4639a43cf129:/home/scratch.zhenyih_sw/reproduce.2421196/test_nvidia# convert-to-uff

I have tf_pb_file = ‘best_deploy.pb’ and uff_file = ‘best_deploy.uff’, but don’t have convert-to-uff. I’m running from a trt5 container. are you running directly from metal/host?

Hello, I’m repro’d it now on DGX P100 16GB GPUs. Triaging now, and will keep you updated.

Ok, don’t know if this is still relevan but to answers your previous questions:

  1. Yes indeed I messed up with the .zip, sorry about that. As you figured out, the ‘all’ option --which you don’t have-- is simply:
run_reference_tensorflow()
build_trt_engine()
run_trt_engine()
  1. The convert-to-uff binary comes with the Python UFF package provided with TensorRT. I installed TensorRT from .tar file and followed the procedure here : https://docs.nvidia.com/deeplearning/sdk/tensorrt-install-guide/index.html#installing-tar

I use this to convert my .pb graph to .uff as indicated here https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#samplecode3

  1. Also you are right I am using a virtual machine with GPU on Google Cloud Platform. Never had any problems with allocation errors / resources sharing in the past though.

Thanks for having a look!

Hello,

The issue is that TensorFlow will reserve almost all the available GPU memory by default. One possible solution is to configure the session with memory limits:

gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.5) 
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options), graph=graph)

The ideal solution would be to release all the GPU memory after TensorFlow executes, but AFAIK this is not yet possible (see python - Clearing Tensorflow GPU memory after model execution - Stack Overflow and Tensorflow or cuda not giving back gpu memory after session closes · Issue #17048 · tensorflow/tensorflow · GitHub)

Thanks very much for the feedback!!

I understand this adress the “Cuda error in allocate” part. Do you by any chance have any more insight on the other error?

max_batch_size = 200
[TensorRT] ERROR: Tensor: Conv_0/Conv2D at max batch size of 200 exceeds the maximum element count of 2147483647