I came back from GTC gung ho on using TensorRT to optimize our tensorflow-based object detection network for better inference performance. Unfortunately, tweaking parameters to the Create_Inference_Graph method, I was only able to get a small performance boost in one case, and in most cases, however, performance got worse. I am looking for some insight into why I am not getting better a better boost from TensorRT. Some facts:
Object detection network similar to YOLO. (But not quite the same … since our detection task is not looking for cars, cats, etc.)
Hosted in Tensorflow/Serving: 1.13.1-gpu container (Cuda 10, cudnn 7.4.1, tensorRT 5.0.2), runtime= nvidia-docker 2.0 , Driver 410.79
Hosted on AWS p3.2xlarge instance type with Tesla V100 GPU
Set use_dynamic_ops to true since we don’t have fixed size inputs and outputs to the various layers
Tensorflow Saved Model as input
We typically have a batch size of only one
We definitely use some unsupported operations but TensorRT could still create anywhere from 10 to 40 TRT engines during conversion.
Tried various combinations of FP32, FP16 and Int8
Tried various combinations of maximum_cached_engines
Tried various combinations of mininum_segment_size (3,5,10 for example)
Watching the output of TFServing, when I do inference, I can see it takes a while to start those TRT engines, which is why I suspect that the “use_dynamic_ops” = true is one of the problems. Comments on this?
Is it due to low batch size? Would we expect to see better perfomance with larger batch sizes?
Also, on one older blog, but no where else, I found a mention of limiting Tensorflow itself to a percent of the GPU:
In NO other documentation did I see this reference to limiting TF GPU memory. Should I be limiting Tensorflow to a percent of the GPU memory and if so, what would you recommend?
(it does look like TF Serving is consuming most of the GPU memory so I will try limiting it.)
±----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… Off | 00000000:00:1E.0 Off | 0 |
| N/A 36C P0 39W / 300W | 15718MiB / 16130MiB | 0% Default |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 22817 C tensorflow_model_server 15708MiB |
±----------------------------------------------------------------------------+
It seems TRT uses the same memory space when it runs in TF Serving. I didn’t see anything else running in the GPU when I first started my model and and as soon as the TRT engines were created the memory usage jumped dramatically. I did find an option to have TF serving use less memory: tensorflow_model_server --per_process_gpu_memory_fraction=0.400000 but it didn’t impact performance.
Lastly, I saw no impact when setting the maximum_cached_engines and this surprised me. Perhaps I am misinterpreting what this parameter does . … can you explain?