Description
I am trying to deploy my model with tensorRT’s C++ API. But the inference time is basically linear to the batch size. The same test is done by Python API, which shows good performance and the inference time increase little with doubled batch size. That is what I expect.
The test result is shown below.
Batch Size |
Python |
C |
64 |
11.7 ms |
26 ms |
32 |
9.4 ms |
13 ms |
16 |
7.5 ms |
7ms |
The C++ code is roughly copied from the official example.
Environment
I use the official docker here, nvcr.io/nvidia/tensorflow:24.01-tf2-py3
TensorRT Version: 8.6.1
GPU Type: 4070 Super
Nvidia Driver Version: 560.94
CUDA Version: 12.6
CUDNN Version:
Operating System + Version: WSL 2 Ubuntu 24.04
Python Version (if applicable): 3.10
TensorFlow Version (if applicable): 2.14.0
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):
Relevant Files
MiniBatch.zip (7.7 KB)
Steps To Reproduce
There is a Readme.md inside for simple and fast reproduction. Thanks you.
Hi @xxHn-pro ,
Can you pls try the below checks and share if performace remains same?
- Optimization Profiles:
- Ensure optimization profiles are set up correctly for dynamic batching in the C++ API.
- Hardware Support:
- Confirm that the hardware, particularly the GPU, supports the optimizations being utilized.
- Experiment with Batch Sizes:
- Test various batch sizes, including multiples of 32, to identify optimal configurations.
- Cache Optimization:
- Adjust input/output caching strategies in the GPU memory for better throughput.
- Builder Configurations:
- Enable FP16 precision and disable INT8 if necessary, and adjust multi-head attention (MHA) fusions in the builder configuration.
Thanks for you reply.
To be clear, the tensorRT model is converted in Python API, as suggested by the document. The SAME model, exactly same model, is loaded with C++ API and benchmarked. They are run in the same machine. So, the Builder Configurations and Hardware Support should be fine.
Optimization Profiles
There are some strange code in the official example, as listed below.
tensorflow::RunOptions run_options;
tensorflow::SessionOptions sess_options;
tensorflow::OptimizerOptions* optimizer_options =
sess_options.config.mutable_graph_options()->mutable_optimizer_options();
optimizer_options->set_opt_level(tensorflow::OptimizerOptions::L0);
optimizer_options->set_global_jit_level(tensorflow::OptimizerOptions::OFF);
It seems like that the author does not want to optimize at all because the loaded model has been converted by tensorRT in Python API. I have try “L1” and “ON_1”. Nothing changes.
I believe that the dynamic batching is configured during the conversion instead of inferencing. If it is wrong, please correct me and show me the right code. Are there any other options I should try?
Experiment with Batch Sizes
It is not necessary to find optimal configurations or optimal Batch Sizes at this stage. All I want now is to make sure the C++ API behaves as the Python API (scale nonlinearly with the batch size).
Cache Optimization
I have checked the code from the official example. As implied by the comment, the input and output should both locate in GPU memory.
Hi @AakankshaS
A simple and basic example of inference using the TF-TRT C++ API, with the same behavior as the Python API is needed. There is an old example, but it is outdated. I made some modifications to make it work, but it does not behave as expected. There should be an updated version available.
Thank you in advance—I’m in urgent need of this for my project.
Hi! @AakankshaS
I have found something different between these two inferences!
Profile
I run the code below and get different result.
nsys profile -w true -t cuda,nvtx,cudnn,cublas -f true -x true -o profile_python python Bench.py
nsys profile -w true -t cuda,nvtx,cudnn,cublas -f true -x true -o profile_c /opt/tensorflow/tensorflow-source/bazel-bin/tensorflow/examples/image_classification/MiniBatch/mini_tftrt --model_path="./resnet50_saved_model_RT" --batch_size=64 --output_to_host=False
Here is the profile result of python API.
And here is for C++ API.
As shown above, there are three threads named like “TP tf_inter_op_parallelism”. They work together in python, while two of them does not work at most time.
The profile files are also provided for your convenience.
Profile.zip (6.4 MB)
Cache Double Check
Besides, the location of input and output is checked by the following code. They are allocated at GPU.
void CheckMemType(const tensorflow::Tensor& input, const std::string& name) {
//
cudaPointerAttributes attrs;
cudaPointerGetAttributes(&attrs, input.flat<float>().data());
#if CUDART_VERSION >= 10000
bool is_physically_gpu = (attrs.type == cudaMemoryTypeDevice);
#else
bool is_physically_gpu = (attrs.memoryType == cudaMemoryTypeDevice);
#endif
printf("%s locate at: %s\n", name.c_str(), is_physically_gpu ? "GPU" : "CPU");
}
Any further advice or suggestion will be appreciate!
I got some update here.
1. cuStreamSynchronize
I found that cuStreamSynchronize take most of time in C++ API. Could I skip it?
I think the real computation is done and the GPU is wasting its time. Is that right?
2.fetch_skip_sync
the inference is called by
TFTRT_ENSURE_OK(
bundle.session->RunCallable(handle, inputs_device, &outputs, nullptr));
And I cannot skip cuStreamSynchronize by setting fetch_skip_sync as false, which raise an error below:
tensorflow/examples/image_classification/MiniBatch/main.cc:298 CallableOptions.fetch_skip_sync = false is not yet implemented. You can set it to true instead, but MUST ensure that Device::Sync() is invoked on the Device corresponding to the fetched tensor before dereferencing the Tensor’s memory.
3.fetch on GPU
As implied from this link Input tensor on GPU in C++ API · Issue #5902 · tensorflow/tensorflow · GitHub, we need to use RunCallable to keep the output tensor on GPU memory.
What should I do next? It feels like I’m stuck in a deadlock.