Speed up float16 conversion using python

kevin.serrano1 · March 25, 2024, 7:35am

Hi all,

Together with my team we have developed an inference plugin for our specific machine vision gstreamer/deepstream pipeline. It is a straightforward plugin where we fetch images from the buffer and feed them into our tensorrt with fp16 precision model.

This plugin performs preprocessing, feed-forward and postprocessing. We’ve observed that the most time-consuming aspect of our preprocessing stage is the conversion of the input batch of images from FP32 to FP16, which is currently performed via Python.

The model takes as input a batch of (3,900,1200,3) images and these are the preprocessing steps:

    # output_batch is a pre-allocated array
    h_o, w_o = output_batch.shape[2], output_batch.shape[3]
    input_batch = np.stack(
        [downscale_image_keeping_aspect_ratio(im, (h_o, w_o)) for im in input_batch],
        axis=0,
    )

    # Rearrange the dimensions
    input_batch = input_batch.transpose(0, 3, 1, 2)  # NHWC -> NCHW
    input_batch = input_batch.astype(np.float32) * (1 / 255.0)

    # Calculate padding offsets for centering
    h, w = input_batch.shape[2], input_batch.shape[3]
    pad_height = (h_o - h) // 2
    pad_width = (w_o - w) // 2

    # Paste the input_batch into the output_batch
    bi = input_batch.shape[0]  # in case of a partial batch
    output_batch[:bi, :, pad_height : pad_height + h, pad_width : pad_width + w] = (
        input_batch.astype(output_batch.dtype)  # cast to fp16 is here!
    )

    return output_batch

Inference is running on a Jetson Xavier AGX along with some processing specific to our application. Here is a sample of our timings:

- INFO: Model profiling:  32.1 ms - preprocess,  51.3 ms - inference,  22.2 ms - postprocess
- INFO: Model profiling:  27.9 ms - preprocess,  50.8 ms - inference,  14.4 ms - postprocess
- INFO: Model profiling:  39.5 ms - preprocess,  50.7 ms - inference,  15.5 ms - postprocess
- INFO: Model profiling:  29.1 ms - preprocess,  52.2 ms - inference,  15.3 ms - postprocess
- INFO: Model profiling:  32.1 ms - preprocess,  51.0 ms - inference,  21.9 ms - postprocess
- INFO: Model profiling:  40.5 ms - preprocess,  51.5 ms - inference,  16.5 ms - postprocess
- INFO: Model profiling:  31.7 ms - preprocess,  50.9 ms - inference,  15.8 ms - postprocess
- INFO: Model profiling:  31.9 ms - preprocess,  50.7 ms - inference,  19.3 ms - postprocess
- INFO: Model profiling:  30.8 ms - preprocess,  50.8 ms - inference,  19.8 ms - postprocess
- INFO: Model profiling:  34.2 ms - preprocess,  51.3 ms - inference,  19.9 ms - postprocess

We would like to reach 10fps inference during runtime but we are currently at around 7fps.

Are there any existing solutions we can apply to help us deal with the FP32 to FP16 slow conversion?

Here are the details of our setup:

- NVIDIA Jetson AGX Xavier [16GB]
   * Jetpack 4.6 [L4T 32.6.1]
   * NV Power Mode: MAXN - Type: 0
   * jetson_stats.service: active
 - Libraries:
   * CUDA: 10.2.300
   * cuDNN: 8.2.1.32
   * TensorRT: 8.0.1.6
   * Visionworks: 1.6.0.501
   * OpenCV: 4.4.0 compiled CUDA: YES
   * VPI: ii libnvvpi1 1.1.15 arm64 NVIDIA Vision Programming Interface library
   * Vulkan: 1.2.70

Many thanks!

AastaLLL · March 25, 2024, 8:20am

Hi,

Do you use TensorRT for inferencing?
If yes, you should be able to feed the fp32 data directly.

TensorRT will automatically add a format conversion layer that can run on GPU.

Thanks.

kevin.serrano1 · March 25, 2024, 12:35pm

Hi @AastaLLL

Thanks for the reply! Yes, I am using TensorRT for inferencing.

Horefully I understood your suggestion correctly, here are my findings:

I disable the explicit cast to fp16 and feed the tensorrt with an fp32 array. There is indeed a conversion done by the engine (or the inference code), but there is also a significant increase in processing time, from 50ms to 80ms. While the preprocessing decreases from 35ms to 17ms

 - INFO: Model profiling:  17.8 ms - preprocess,  80.8 ms - inference,  18.9 ms - postprocess
 - INFO: Model profiling:  17.3 ms - preprocess,  85.5 ms - inference,  14.1 ms - postprocess
 - INFO: Model profiling:  17.9 ms - preprocess,  84.9 ms - inference,  14.5 ms - postprocess
 - INFO: Model profiling:  17.5 ms - preprocess,  82.8 ms - inference,  13.1 ms - postprocess
 - INFO: Model profiling:  16.2 ms - preprocess,  82.4 ms - inference,  14.9 ms - postprocess

My tensorrt engine file was compiled following a pytorch → onnx → tensorrt approach. During the conversion, my inputs are already fp16 torch tensors. Should I do something specific to add the fp32 to fp16 conversion layer?

for reference, I am adapting code found here:

github.com

NVIDIA/object-detection-tensorrt-example/blob/master/SSD_Model/utils/common.py

#
# Copyright 1993-2019 NVIDIA Corporation.  All rights reserved.
#
# NOTICE TO LICENSEE:
#
# This source code and/or documentation ("Licensed Deliverables") are
# subject to NVIDIA intellectual property rights under U.S. and
# international Copyright laws.
#
# These Licensed Deliverables contained herein is PROPRIETARY and
# CONFIDENTIAL to NVIDIA and is being provided under the terms and
# conditions of a form of NVIDIA software license agreement by and
# between NVIDIA and Licensee ("License Agreement") or electronically
# accepted by Licensee.  Notwithstanding any terms or conditions to
# the contrary in the License Agreement, reproduction or disclosure
# of the Licensed Deliverables to any third party without the express
# written consent of NVIDIA is prohibited.
#
# NOTWITHSTANDING ANY TERMS OR CONDITIONS TO THE CONTRARY IN THE
# LICENSE AGREEMENT, NVIDIA MAKES NO REPRESENTATION ABOUT THE

This file has been truncated. show original


class HostDeviceMem(object):
    # Simple helper data class that's a little nicer to use than a 2-tuple.

    def __init__(self, host_mem, device_mem, shape, name):
        self.host = host_mem
        self.device = device_mem
        self.shape = shape
        self.dtype = host_mem.dtype.  # here the dtype is float16
        self.name = name

BR

AastaLLL · April 11, 2024, 8:22am

Hi,

Just want to double-check.
Have you maximized the device performance first?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Another alternative is to move the resize and transform preprocessing to CUDA.
You can find some examples in the jetson_utils:

Thanks.

kevin.serrano1 · April 12, 2024, 4:08pm

Hi!

I am using the nvpmodel -m 0 but still the f16 conversion using numpy was slow.

I have a custom yolov5 model and, since the model is written in pytorch, I ended up using register_forward_pre_hook to add the following operations during the export procedure (pytorch → onnx → tensorrt).

x = x.half() if module.fp16 else x.float()
x = x / 255.0

Once converted to tensorrt I was able to feed in regular int8 images to the engine and delegate the fp16 conversion to the GPU along with some other stages.

Thanks for the hints! Hope this helps some other guys with similar issues.

Cheers

AastaLLL · April 15, 2024, 5:30am

Hi,

Thanks for the feedback as well!

system · May 7, 2024, 4:14am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Inference is so slow with torch1.6 Jetson Xavier NX nvbugs , pytorch	12	3540	October 23, 2020
Human pose detection model (MoveNet) TensorRT conversion on NVIDIA Jetson Jetson Xavier NX tensorrt , tensorflow , jetson-inference	7	2634	June 16, 2022
Low FPS on Jetson Nano using TensorRT Jetson Nano tensorrt , tensorflow	7	1216	August 27, 2020
Inference Speed Jetson Xavier NX pytorch	6	889	April 12, 2023
Extremely slow inference with MMDetection on Jetson Xavier NX Jetson Xavier NX jetson-inference	7	2033	June 27, 2022
issues with tensorrt uffs ssd int8 sample Jetson AGX Xavier	5	565	October 18, 2021
Extremely slow inference in TensorRT for live semantic segmentation model Jetson AGX Xavier tensorrt , tensorflow , jetson-inference	11	4389	April 12, 2022
TensorRt inference is taking 1.5 sec to inference a single frame.i want to speed up my inference TensorRT tensorrt , jetson-inference , jetson-nano	1	914	March 13, 2023
P6000 TensorRT too slow and the serialized fp16-model size is not as expected TensorRT tensorrt	1	462	April 4, 2023
Tensorflow running very slow on Nvidia Jetson AGX Orin Jetson AGX Orin tensorflow	3	50	March 4, 2025

Speed up float16 conversion using python

Related topics