Hi,
Request you to share the ONNX model and the script if not shared already so that we can assist you better.
Alongside you can try few things:
validating your model with the below snippet
check_model.py
import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
onnx.checker.check_model(model).
2) Try running your model with trtexec command. https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec
In case you are still facing issue, request you to share the trtexec âââverbose"" log for further debugging
Thanks!
but i donât have caffe i just have onnx , i have convert to trt ,but when i use the following code to check :`import tensorrt as trt
TRT_LOGGER = trt.Logger()
def get_engine1(engine_path):
# If a serialized engine exists, use it instead of building an engine.
print(âReading engine from file {}â.format(engine_path))
with open(engine_path, ârbâ) as f, trt.Runtime(TRT_LOGGER) as runtime:
return runtime.deserialize_cuda_engine(f.read())
if name == âmainâ:
engine_file_path = âF-Let.trtâ
# ćŻçšnetronæ„çonnxçèŸćșæ°éćć°ș毞
engines = get_engine1(engine_file_path)
for binding in engines:
size = trt.volume(engines.get_binding_shape(binding)) * 1
dims = engines.get_binding_shape(binding)
print(âsize=â, size)
print(âdims=â, dims)
print(âbinding=â, binding)
print(âinput =â, engines.binding_is_input(binding))
dtype = trt.nptype(engines.get_binding_dtype(binding))
print(âdtype =â, dtype)`
whether is int8, i got this:
@NVES but i donât have caffe i just have onnx , i have convert to trt ,but when i use the following code to check :`import tensorrt as trt
TRT_LOGGER = trt.Logger()
def get_engine1(engine_path):
If a serialized engine exists, use it instead of building an engine.
print(âReading engine from file {}â.format(engine_path))
with open(engine_path, ârbâ) as f, trt.Runtime(TRT_LOGGER) as runtime:
return runtime.deserialize_cuda_engine(f.read())
if name == â main â:
engine_file_path = âF-Let.trtâ
ćŻçšnetronæ„çonnxçèŸćșæ°éćć°ș毞
engines = get_engine1(engine_file_path)
for binding in engines:
size = trt.volume(engines.get_binding_shape(binding)) * 1
dims = engines.get_binding_shape(binding)
print(âsize=â, size)
print(âdims=â, dims)
print(âbinding=â, binding)
print(âinput =â, engines.binding_is_input(binding))
dtype = trt.nptype(engines.get_binding_dtype(binding))
print(âdtype =â, dtype)`
whether is int8, i got this:
The input/output is still FP32 even for INT8 model unless reformatFree is used, so we cannot use the above code to check INT8.
Also there is no INT8 support for Jetson Nano, it is maxwell chip.
@spolisetty now i use the following code to create a fp16 model onnx_to_tensorrt.py (2.8 KB)
and the inference code inference-10.py (5.5 KB)
but i just found that fp16 is slower than fp32, Can you help me?
We are unable to reproduce this issue, we got FP16 is faster than FP32. We recommend you to please share trtexec --verbose logs of both for better debugging. FP16 should never be slower than FP32. In the worst case (i.e. all fp16 kernels are worse than fp32 kernels), TRT would just fallback to fp32.
For your reference, we can observe fp32 mean latency is 0.160975 and fp16 is 0.136759 which tells fp16 is faster.
fp32
[05/26/2021-17:30:49] [I] Host Latency
[05/26/2021-17:30:49] [I] min: 0.152832 ms (end to end 0.163513 ms)
[05/26/2021-17:30:49] [I] max: 4.63269 ms (end to end 4.64197 ms)
[05/26/2021-17:30:49] [I] mean: 0.160975 ms (end to end 0.169979 ms)
[05/26/2021-17:30:49] [I] median: 0.158691 ms (end to end 0.16748 ms)
[05/26/2021-17:30:49] [I] percentile: 0.174606 ms at 99% (end to end 0.184784 ms at 99%)
[05/26/2021-17:30:49] [I] throughput: 5181.99 qps
[05/26/2021-17:30:49] [I] walltime: 3.00039 s
[05/26/2021-17:30:49] [I] Enqueue Time
[05/26/2021-17:30:49] [I] min: 0.143646 ms
[05/26/2021-17:30:49] [I] max: 4.62085 ms
[05/26/2021-17:30:49] [I] median: 0.147705 ms
[05/26/2021-17:30:49] [I] GPU Compute
[05/26/2021-17:30:49] [I] min: 0.139648 ms
[05/26/2021-17:30:49] [I] max: 4.61816 ms
[05/26/2021-17:30:49] [I] mean: 0.147407 ms
[05/26/2021-17:30:49] [I] median: 0.145508 ms
[05/26/2021-17:30:49] [I] percentile: 0.160156 ms at 99%
[05/26/2021-17:30:49] [I] total compute time: 2.29189 s
&&&& PASSED TensorRT.trtexec # trtexec --onnx=F_TorrentNet.onnx --verbose
fp16
[05/26/2021-17:28:57] [I] Host Latency
[05/26/2021-17:28:57] [I] min: 0.130615 ms (end to end 0.143433 ms)
[05/26/2021-17:28:57] [I] max: 3.4176 ms (end to end 3.42505 ms)
[05/26/2021-17:28:57] [I] mean: 0.136759 ms (end to end 0.151333 ms)
[05/26/2021-17:28:57] [I] median: 0.136841 ms (end to end 0.150391 ms)
[05/26/2021-17:28:57] [I] percentile: 0.149658 ms at 99% (end to end 0.162842 ms at 99%)
[05/26/2021-17:28:57] [I] throughput: 6425.03 qps
[05/26/2021-17:28:57] [I] walltime: 3.0003 s
[05/26/2021-17:28:57] [I] Enqueue Time
[05/26/2021-17:28:57] [I] min: 0.107452 ms
[05/26/2021-17:28:57] [I] max: 3.40491 ms
[05/26/2021-17:28:57] [I] median: 0.111328 ms
[05/26/2021-17:28:57] [I] GPU Compute
[05/26/2021-17:28:57] [I] min: 0.116943 ms
[05/26/2021-17:28:57] [I] max: 3.40259 ms
[05/26/2021-17:28:57] [I] mean: 0.122423 ms
[05/26/2021-17:28:57] [I] median: 0.123291 ms
[05/26/2021-17:28:57] [I] percentile: 0.132812 ms at 99%
[05/26/2021-17:28:57] [I] total compute time: 2.35996 s
&&&& PASSED TensorRT.trtexec # trtexec --onnx=F_TorrentNet.onnx --fp16 --verbose
[quote=âspolisetty, post:14, topic:178479, full:trueâ]
Hi @530869411,
We are unable to reproduce this issue, we got FP16 is faster than FP32. We recommend you to please share trtexec --verbose logs of both for better debugging. FP16 should never be slower than FP32. In the worst case (i.e. all fp16 kernels are worse than fp32 kernels), TRT would just fallback to fp32.
For your reference, we can observe fp32 mean latency is 0.160975 and fp16 is 0.136759 which tells fp16 is faster.
fp32
[05/26/2021-17:30:49] [I] Host Latency
[05/26/2021-17:30:49] [I] min: 0.152832 ms (end to end 0.163513 ms)
[05/26/2021-17:30:49] [I] max: 4.63269 ms (end to end 4.64197 ms)
[05/26/2021-17:30:49] [I] mean: 0.160975 ms (end to end 0.169979 ms)
[05/26/2021-17:30:49] [I] median: 0.158691 ms (end to end 0.16748 ms)
[05/26/2021-17:30:49] [I] percentile: 0.174606 ms at 99% (end to end 0.184784 ms at 99%)
[05/26/2021-17:30:49] [I] throughput: 5181.99 qps
[05/26/2021-17:30:49] [I] walltime: 3.00039 s
[05/26/2021-17:30:49] [I] Enqueue Time
[05/26/2021-17:30:49] [I] min: 0.143646 ms
[05/26/2021-17:30:49] [I] max: 4.62085 ms
[05/26/2021-17:30:49] [I] median: 0.147705 ms
[05/26/2021-17:30:49] [I] GPU Compute
[05/26/2021-17:30:49] [I] min: 0.139648 ms
[05/26/2021-17:30:49] [I] max: 4.61816 ms
[05/26/2021-17:30:49] [I] mean: 0.147407 ms
[05/26/2021-17:30:49] [I] median: 0.145508 ms
[05/26/2021-17:30:49] [I] percentile: 0.160156 ms at 99%
[05/26/2021-17:30:49] [I] total compute time: 2.29189 s
&&&& PASSED TensorRT.trtexec # trtexec --onnx=F_TorrentNet.onnx --verbose
fp16
[05/26/2021-17:28:57] [I] Host Latency
[05/26/2021-17:28:57] [I] min: 0.130615 ms (end to end 0.143433 ms)
[05/26/2021-17:28:57] [I] max: 3.4176 ms (end to end 3.42505 ms)
[05/26/2021-17:28:57] [I] mean: 0.136759 ms (end to end 0.151333 ms)
[05/26/2021-17:28:57] [I] median: 0.136841 ms (end to end 0.150391 ms)
[05/26/2021-17:28:57] [I] percentile: 0.149658 ms at 99% (end to end 0.162842 ms at 99%)
[05/26/2021-17:28:57] [I] throughput: 6425.03 qps
[05/26/2021-17:28:57] [I] walltime: 3.0003 s
[05/26/2021-17:28:57] [I] Enqueue Time
[05/26/2021-17:28:57] [I] min: 0.107452 ms
[05/26/2021-17:28:57] [I] max: 3.40491 ms
[05/26/2021-17:28:57] [I] median: 0.111328 ms
[05/26/2021-17:28:57] [I] GPU Compute
[05/26/2021-17:28:57] [I] min: 0.116943 ms
[05/26/2021-17:28:57] [I] max: 3.40259 ms
[05/26/2021-17:28:57] [I] mean: 0.122423 ms
[05/26/2021-17:28:57] [I] median: 0.123291 ms
[05/26/2021-17:28:57] [I] percentile: 0.132812 ms at 99%
[05/26/2021-17:28:57] [I] total compute time: 2.35996 s
&&&& PASSED TensorRT.trtexec # trtexec --onnx=F_TorrentNet.onnx --fp16 --verbose
Thank you. [/quote]
but when i use the .trt to do inference .fp16.trt inference is slower than fp32, in some model is slower than fp32
@spolisetty hi the above inference-10.py is my inference code ,you could download the above code to test. you could download F_Let.pth and F_Let_fp16.trt and inference-10.py to test,whatâs more i found that inference on Lenet fp16 is faster but in MLP is slower. F_MLP.pth (940.4 KB) F_MLP_fp16.trt (484.2 KB)
We tried running convert_to_onnx.py. But facing some errors. We recommend you to please share only ONNX model, so that we will generate FP16 and FP32 engines and verify the performance to reproduce the issue.
For your info, we need to execute the conversion on the machine on which we will run inference. This is because TensorRT optimizes the graph by using the available GPUs and thus the generated engine cannot be used on different platform.
Sorry for the delayed response. We tried running converting model to trt and run inference script to test the performance, but facing many errors. And it looks like youâre calculating time for loading pytorch model and trt engine at a time. We recommend you to write separate code to calculate only time for inference on trt engine to get correct calculation.
when we run using trtexec, couldnât reproduce the issue (fp16 is slower than fp32).