I wanted to benchmark depth estimation model on Jetson Xavier NX in terms of speed and memory usage. For that purpose I have converted pytorch model to ONNX format and than I have created TensorRT engines with fp32, fp16 and int8 precisions. In case of speed(FPS) everything seems to be correct, fp16 model is faster than fp32 and int8 model is the fastest.
The memory usage is around 1.9 Gb in case of fp32 and around 1.1 Gb in case of fp16 and int8. I guess the difference between fp32 and fp16 memory usages is reasonable, but I can not understand why it is similar for fp16 and int8 engines.
Could someone explain is this behavior correct?
Could you please advise how can I profile memory usage ? (My application is written in Python)
Is there any method to calculate FLOPs or TOPs for TensorRT engine?