Hi all,
Together with my team we have developed an inference plugin for our specific machine vision gstreamer/deepstream pipeline. It is a straightforward plugin where we fetch images from the buffer and feed them into our tensorrt with fp16 precision model.
This plugin performs preprocessing, feed-forward and postprocessing. We’ve observed that the most time-consuming aspect of our preprocessing stage is the conversion of the input batch of images from FP32 to FP16, which is currently performed via Python.
The model takes as input a batch of (3,900,1200,3) images and these are the preprocessing steps:
# output_batch is a pre-allocated array
h_o, w_o = output_batch.shape[2], output_batch.shape[3]
input_batch = np.stack(
[downscale_image_keeping_aspect_ratio(im, (h_o, w_o)) for im in input_batch],
axis=0,
)
# Rearrange the dimensions
input_batch = input_batch.transpose(0, 3, 1, 2) # NHWC -> NCHW
input_batch = input_batch.astype(np.float32) * (1 / 255.0)
# Calculate padding offsets for centering
h, w = input_batch.shape[2], input_batch.shape[3]
pad_height = (h_o - h) // 2
pad_width = (w_o - w) // 2
# Paste the input_batch into the output_batch
bi = input_batch.shape[0] # in case of a partial batch
output_batch[:bi, :, pad_height : pad_height + h, pad_width : pad_width + w] = (
input_batch.astype(output_batch.dtype) # cast to fp16 is here!
)
return output_batch
Inference is running on a Jetson Xavier AGX along with some processing specific to our application. Here is a sample of our timings:
- INFO: Model profiling: 32.1 ms - preprocess, 51.3 ms - inference, 22.2 ms - postprocess
- INFO: Model profiling: 27.9 ms - preprocess, 50.8 ms - inference, 14.4 ms - postprocess
- INFO: Model profiling: 39.5 ms - preprocess, 50.7 ms - inference, 15.5 ms - postprocess
- INFO: Model profiling: 29.1 ms - preprocess, 52.2 ms - inference, 15.3 ms - postprocess
- INFO: Model profiling: 32.1 ms - preprocess, 51.0 ms - inference, 21.9 ms - postprocess
- INFO: Model profiling: 40.5 ms - preprocess, 51.5 ms - inference, 16.5 ms - postprocess
- INFO: Model profiling: 31.7 ms - preprocess, 50.9 ms - inference, 15.8 ms - postprocess
- INFO: Model profiling: 31.9 ms - preprocess, 50.7 ms - inference, 19.3 ms - postprocess
- INFO: Model profiling: 30.8 ms - preprocess, 50.8 ms - inference, 19.8 ms - postprocess
- INFO: Model profiling: 34.2 ms - preprocess, 51.3 ms - inference, 19.9 ms - postprocess
We would like to reach 10fps inference during runtime but we are currently at around 7fps.
Are there any existing solutions we can apply to help us deal with the FP32 to FP16 slow conversion?
Here are the details of our setup:
- NVIDIA Jetson AGX Xavier [16GB]
* Jetpack 4.6 [L4T 32.6.1]
* NV Power Mode: MAXN - Type: 0
* jetson_stats.service: active
- Libraries:
* CUDA: 10.2.300
* cuDNN: 8.2.1.32
* TensorRT: 8.0.1.6
* Visionworks: 1.6.0.501
* OpenCV: 4.4.0 compiled CUDA: YES
* VPI: ii libnvvpi1 1.1.15 arm64 NVIDIA Vision Programming Interface library
* Vulkan: 1.2.70
Many thanks!