Description
I am encountering an issue where, despite using the same PC, GPU, ONNX model file, and trtexec commands, the generated FP16 engine files exhibit different internal structures (layer configurations) and inference times each time I run the command.
I have tried multiple runs with the same setup, but the results are inconsistent, and I am having difficulty obtaining consistent outputs with identical conditions.
Problem:
- The internal structure (layers) of the FP16 engine files are different each time(generated from the same PC, GPU, ONNX model file, and
trtexec
command)
- Inference time also varies across each run.
- I am looking for ways to fix or stabilize the GPU state and configure the environment to get consistent results.
Questions:
- Is there a way to generate consistent FP16 engines with the same environment each time?
- Are there any options in
trtexec
that can help fix or optimize the GPU state for consistent results?
Environment
TensorRT Version: 7.2.1.6
GPU Type: 3080Ti Laptop
Nvidia Driver Version: 560.94
CUDA Version: cuda_11.1.0_456.43_win10
CUDNN Version: cudnn-11.1-windows-x64-v8.0.5.39
Operating System + Version: Windows 11 Pro
To achieve consistent FP16 engine generation in TensorRT and stabilize the GPU state, consider the following solutions:
- Avoid Numerical Overflow with FP16 Scales: Use FP32 scales for Q/DQ operations to prevent numerical overflow and issues related to FP16 precision.
- Optimize Network Architecture: Be conscious of potential performance regressions when using FP16 precision. Adjust your network architecture or precision settings according to your specific workload requirements.
- Manage GPU Memory Usage: Adjust batch sizes or other parameters in the builder config to reduce peak GPU memory usage, which could help maintain consistent performance.
- Handle NaNs and Infinite Values: Use FP32 precision for sensitive layers or employ techniques like Temporal Fusion Transformers on supported NVIDIA GPUs to maintain numerical stability.
- Adjust Builder Optimization Levels: If higher optimization levels are resulting in inconsistent performance, consider using a lower level for potentially improved results.
- Monitor Performance Regressions: Stay updated on TensorRT releases, as performance regressions like those noted in recent versions may be addressed in future updates.
- Address Operating System Differences: Different operating systems may impact workload latency. Ensure you are using the latest drivers and CUDA versions to optimize performance consistency.
By implementing these practices, you can improve the consistency of FP16 engine generation and stabilize the GPU environment in your setup.