Speed up float16 conversion using python

Hi all,

Together with my team we have developed an inference plugin for our specific machine vision gstreamer/deepstream pipeline. It is a straightforward plugin where we fetch images from the buffer and feed them into our tensorrt with fp16 precision model.

This plugin performs preprocessing, feed-forward and postprocessing. We’ve observed that the most time-consuming aspect of our preprocessing stage is the conversion of the input batch of images from FP32 to FP16, which is currently performed via Python.

The model takes as input a batch of (3,900,1200,3) images and these are the preprocessing steps:

    # output_batch is a pre-allocated array
    h_o, w_o = output_batch.shape[2], output_batch.shape[3]
    input_batch = np.stack(
        [downscale_image_keeping_aspect_ratio(im, (h_o, w_o)) for im in input_batch],
        axis=0,
    )

    # Rearrange the dimensions
    input_batch = input_batch.transpose(0, 3, 1, 2)  # NHWC -> NCHW
    input_batch = input_batch.astype(np.float32) * (1 / 255.0)

    # Calculate padding offsets for centering
    h, w = input_batch.shape[2], input_batch.shape[3]
    pad_height = (h_o - h) // 2
    pad_width = (w_o - w) // 2

    # Paste the input_batch into the output_batch
    bi = input_batch.shape[0]  # in case of a partial batch
    output_batch[:bi, :, pad_height : pad_height + h, pad_width : pad_width + w] = (
        input_batch.astype(output_batch.dtype)  # cast to fp16 is here!
    )

    return output_batch

Inference is running on a Jetson Xavier AGX along with some processing specific to our application. Here is a sample of our timings:

- INFO: Model profiling:  32.1 ms - preprocess,  51.3 ms - inference,  22.2 ms - postprocess
- INFO: Model profiling:  27.9 ms - preprocess,  50.8 ms - inference,  14.4 ms - postprocess
- INFO: Model profiling:  39.5 ms - preprocess,  50.7 ms - inference,  15.5 ms - postprocess
- INFO: Model profiling:  29.1 ms - preprocess,  52.2 ms - inference,  15.3 ms - postprocess
- INFO: Model profiling:  32.1 ms - preprocess,  51.0 ms - inference,  21.9 ms - postprocess
- INFO: Model profiling:  40.5 ms - preprocess,  51.5 ms - inference,  16.5 ms - postprocess
- INFO: Model profiling:  31.7 ms - preprocess,  50.9 ms - inference,  15.8 ms - postprocess
- INFO: Model profiling:  31.9 ms - preprocess,  50.7 ms - inference,  19.3 ms - postprocess
- INFO: Model profiling:  30.8 ms - preprocess,  50.8 ms - inference,  19.8 ms - postprocess
- INFO: Model profiling:  34.2 ms - preprocess,  51.3 ms - inference,  19.9 ms - postprocess

We would like to reach 10fps inference during runtime but we are currently at around 7fps.

Are there any existing solutions we can apply to help us deal with the FP32 to FP16 slow conversion?

Here are the details of our setup:

- NVIDIA Jetson AGX Xavier [16GB]
   * Jetpack 4.6 [L4T 32.6.1]
   * NV Power Mode: MAXN - Type: 0
   * jetson_stats.service: active
 - Libraries:
   * CUDA: 10.2.300
   * cuDNN: 8.2.1.32
   * TensorRT: 8.0.1.6
   * Visionworks: 1.6.0.501
   * OpenCV: 4.4.0 compiled CUDA: YES
   * VPI: ii libnvvpi1 1.1.15 arm64 NVIDIA Vision Programming Interface library
   * Vulkan: 1.2.70

Many thanks!

Hi,

Do you use TensorRT for inferencing?
If yes, you should be able to feed the fp32 data directly.

TensorRT will automatically add a format conversion layer that can run on GPU.

Thanks.

Hi @AastaLLL

Thanks for the reply! Yes, I am using TensorRT for inferencing.

Horefully I understood your suggestion correctly, here are my findings:

I disable the explicit cast to fp16 and feed the tensorrt with an fp32 array. There is indeed a conversion done by the engine (or the inference code), but there is also a significant increase in processing time, from 50ms to 80ms. While the preprocessing decreases from 35ms to 17ms

 - INFO: Model profiling:  17.8 ms - preprocess,  80.8 ms - inference,  18.9 ms - postprocess
 - INFO: Model profiling:  17.3 ms - preprocess,  85.5 ms - inference,  14.1 ms - postprocess
 - INFO: Model profiling:  17.9 ms - preprocess,  84.9 ms - inference,  14.5 ms - postprocess
 - INFO: Model profiling:  17.5 ms - preprocess,  82.8 ms - inference,  13.1 ms - postprocess
 - INFO: Model profiling:  16.2 ms - preprocess,  82.4 ms - inference,  14.9 ms - postprocess

My tensorrt engine file was compiled following a pytorch → onnx → tensorrt approach. During the conversion, my inputs are already fp16 torch tensors. Should I do something specific to add the fp32 to fp16 conversion layer?

for reference, I am adapting code found here:


class HostDeviceMem(object):
    # Simple helper data class that's a little nicer to use than a 2-tuple.

    def __init__(self, host_mem, device_mem, shape, name):
        self.host = host_mem
        self.device = device_mem
        self.shape = shape
        self.dtype = host_mem.dtype.  # here the dtype is float16
        self.name = name

BR

Hi,

Just want to double-check.
Have you maximized the device performance first?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Another alternative is to move the resize and transform preprocessing to CUDA.
You can find some examples in the jetson_utils:

Thanks.

Hi!

I am using the nvpmodel -m 0 but still the f16 conversion using numpy was slow.

I have a custom yolov5 model and, since the model is written in pytorch, I ended up using register_forward_pre_hook to add the following operations during the export procedure (pytorch → onnx → tensorrt).

x = x.half() if module.fp16 else x.float()
x = x / 255.0

Once converted to tensorrt I was able to feed in regular int8 images to the engine and delegate the fp16 conversion to the GPU along with some other stages.

Thanks for the hints! Hope this helps some other guys with similar issues.

Cheers

Hi,

Thanks for the feedback as well!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.