Same memory usage for fp16 and int8

Hello,

I wanted to benchmark depth estimation model on Jetson Xavier NX in terms of speed and memory usage. For that purpose I have converted pytorch model to ONNX format and than I have created TensorRT engines with fp32, fp16 and int8 precisions. In case of speed(FPS) everything seems to be correct, fp16 model is faster than fp32 and int8 model is the fastest.
The memory usage is around 1.9 Gb in case of fp32 and around 1.1 Gb in case of fp16 and int8. I guess the difference between fp32 and fp16 memory usages is reasonable, but I can not understand why it is similar for fp16 and int8 engines.
Could someone explain is this behavior correct?
Could you please advise how can I profile memory usage ? (My application is written in Python)
Is there any method to calculate FLOPs or TOPs for TensorRT engine?

Thanks,
Tigran

Hi,

The memory usage depends on the inference algorithm used by TensorRT.
It doesn’t guarantee that lower precision will use less memory.

But you can specify the preferred memory amount when creating the TensorRT engine.
For example:

$ /usr/src/tensorrt/bin/trtexec --workspace=16 ...

To get FLOPs information, you can use nvprof with --metrics flag.
For example:

$ sudo /usr/local/cuda-10.2/bin/nvprof --metrics flops_sp /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx
==30550== Profiling result:
==30550== Metric result:
Invocations                               Metric Name                            Metric Description         Min         Max         Avg
Device "Xavier (0)"
    Kernel: void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=3, int=4, int=1, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
         80                             flop_count_sp   Floating Point Operations(Single Precision)     1350272     3075136     2212704
    Kernel: void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=2, int=4, int=1, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
         80                             flop_count_sp   Floating Point Operations(Single Precision)     2515072     5737536     4126304
    Kernel: void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=1, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
...

Thanks.

1 Like

Hi,

Thank you for explanation @AastaLLL .
One more question please, how can I get the amount of int8 operations?
I was able to measure the amount of fp32 and fp16 operations using flops_sp and flop_count_hp metrics respectively, but I can not find any metric for int8 operations.

Thanks.

Hi,

You can use tensor_int_fu_utilization mentioned in the below document:
https://docs.nvidia.com/cuda/archive/11.0_GA/profiler-users-guide/index.html#metrics-reference-7x

Thanks.