Run PeopleNet with tensorrt

Hey everyone. After looking into a lot of places, blogs and repos I could manage to run a .engine or .trt file for Detectnet_v2 with preprocessing and proper postprocessing.

Here is my working code, hope it helps future persons:

import os
import time

import cv2
import matplotlib.pyplot as plt
import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt
from PIL import Image

class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem): = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str( + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()

def load_engine(trt_runtime, engine_path):
    with open(engine_path, "rb") as f:
        engine_data =
    engine = trt_runtime.deserialize_cuda_engine(engine_data)
    return engine

def allocate_buffers(engine, batch_size=1):
    """Allocates host and device buffer for TRT engine inference.
    This function is similair to the one in, but
    converts network outputs (which are np.float32) appropriately
    before writing them to Python buffer. This is needed, since
    TensorRT plugins doesn't support output type description, and
    in our particular case, we use NMS plugin as network output.
        engine (trt.ICudaEngine): TensorRT engine
        inputs [HostDeviceMem]: engine input memory
        outputs [HostDeviceMem]: engine output memory
        bindings [int]: buffer to device bindings
        stream (cuda.Stream): cuda stream for engine inference synchronization
    inputs = []
    outputs = []
    bindings = []
    stream = cuda.Stream()

    # Current NMS implementation in TRT only supports DataType.FLOAT but
    # it may change in the future, which could brake this sample here
    # when using lower precision [e.g. NMS output would not be np.float32
    # anymore, even though this is assumed in binding_to_type]
    binding_to_type = {
        "input_1": np.float32,
        "output_bbox/BiasAdd": np.float32,
        "output_cov/Sigmoid": np.float32,

    for binding in engine:
        size = trt.volume(engine.get_binding_shape(binding)) * batch_size
        dtype = binding_to_type[str(binding)]
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        # Append to the appropriate list.
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
            outputs.append(HostDeviceMem(host_mem, device_mem))
    return inputs, outputs, bindings, stream

# This function is generalized for multiple inputs/outputs.
# inputs and outputs are expected to be lists of HostDeviceMem objects.
def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
    # Transfer input data to the GPU.
    [cuda.memcpy_htod_async(inp.device,, stream) for inp in inputs]
    # Run inference.
        batch_size=batch_size, bindings=bindings, stream_handle=stream.handle
    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(, out.device, stream) for out in outputs]
    # Synchronize the stream
    # Return only the host outputs.
    return [ for out in outputs]

def process_image(arr, w, h):
    image = Image.fromarray(np.uint8(arr))

    image_resized = image.resize(size=(w, h), resample=Image.BILINEAR)
    img_np = np.array(image_resized)
    # HWC -> CHW
    img_np = img_np.transpose((2, 0, 1))
    # Normalize to [0.0, 1.0] interval (expected by model)
    img_np = (1.0 / 255.0) * img_np
    img_np = img_np.ravel()
    return img_np

def predict(image, model_w, model_h):
    """Infers model on batch of same sized images resized to fit the model.
        image_paths (str): paths to images, that will be packed into batch
            and fed into model
    img = process_image(image, model_w, model_h)
    # Copy it into appropriate place into memory
    # (self.inputs was returned earlier by allocate_buffers())
    np.copyto(inputs[0].host, img.ravel())

    # When infering on single image, we measure inference
    # time to output it to the user
    inference_start_time = time.time()

    # Fetch output from the model
    [detection_out, keepCount_out] = do_inference(
        context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream

    # Output inference time
        "TensorRT inference time: {} ms".format(
            int(round((time.time() - inference_start_time) * 1000))

    # And return results
    return detection_out, keepCount_out

# -------------- MODEL PARAMETERS FOR DETECTNET_V2 --------------------------------
model_h = 544
model_w = 960
stride = 16
box_norm = 35.0

grid_h = int(model_h / stride)
grid_w = int(model_w / stride)
grid_size = grid_h * grid_w

grid_centers_w = []
grid_centers_h = []

for i in range(grid_h):
    value = (i * stride + 0.5) / box_norm

for i in range(grid_w):
    value = (i * stride + 0.5) / box_norm

def applyBoxNorm(o1, o2, o3, o4, x, y):
    Applies the GridNet box normalization
        o1 (float): first argument of the result
        o2 (float): second argument of the result
        o3 (float): third argument of the result
        o4 (float): fourth argument of the result
        x: row index on the grid
        y: column index on the grid

        float: rescaled first argument
        float: rescaled second argument
        float: rescaled third argument
        float: rescaled fourth argument
    o1 = (o1 - grid_centers_w[x]) * -box_norm
    o2 = (o2 - grid_centers_h[y]) * -box_norm
    o3 = (o3 + grid_centers_w[x]) * box_norm
    o4 = (o4 + grid_centers_h[y]) * box_norm
    return o1, o2, o3, o4

def postprocess(outputs, min_confidence, analysis_classes, wh_format=True):
    Postprocesses the inference output
        outputs (list of float): inference output
        min_confidence (float): min confidence to accept detection
        analysis_classes (list of int): indices of the classes to consider

    Returns: list of list tuple: each element is a two list tuple (x, y) representing the corners of a bb

    bbs = []
    class_ids = []
    scores = []
    for c in analysis_classes:

        x1_idx = c * 4 * grid_size
        y1_idx = x1_idx + grid_size
        x2_idx = y1_idx + grid_size
        y2_idx = x2_idx + grid_size

        boxes = outputs[0]
        for h in range(grid_h):
            for w in range(grid_w):
                i = w + h * grid_w
                score = outputs[1][c * grid_size + i]
                if score >= min_confidence:
                    o1 = boxes[x1_idx + w + h * grid_w]
                    o2 = boxes[y1_idx + w + h * grid_w]
                    o3 = boxes[x2_idx + w + h * grid_w]
                    o4 = boxes[y2_idx + w + h * grid_w]

                    o1, o2, o3, o4 = applyBoxNorm(o1, o2, o3, o4, w, h)

                    xmin = int(o1)
                    ymin = int(o2)
                    xmax = int(o3)
                    ymax = int(o4)
                    if wh_format:
                        bbs.append([xmin, ymin, xmax - xmin, ymax - ymin])
                        bbs.append([xmin, ymin, xmax, ymax])

    return bbs, class_ids, scores

# TensorRT logger singleton
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_engine_path = os.path.join("YOUR .TRT FILE HERE")

trt_runtime = trt.Runtime(TRT_LOGGER)
trt_engine = load_engine(trt_runtime, trt_engine_path)

# This allocates memory for network inputs/outputs on both CPU and GPU
inputs, outputs, bindings, stream = allocate_buffers(trt_engine)

# Execution context is needed for inference
context = trt_engine.create_execution_context()

image = cv2.imread("YOUR IMAGE HERE")[..., ::-1]

detection_out, keepCount_out = predict(image, model_w, model_h)

threshold = 0.1
bboxes, class_ids, scores = postprocess(
    [detection_out, keepCount_out], threshold, list(range(NUM_CLASSES))

image_cpy = image.copy()
image_cpy = cv2.resize(image_cpy, (model_w, model_h))

# Final bboxes only take afet NMS
indexes = cv2.dnn.NMSBoxes(bboxes, scores, threshold, 0.5)
for idx in indexes:
    idx = int(idx)
    xmin, ymin, w, h = bboxes[idx]
    class_id = class_ids[idx]
    color = [255, 0, 0] if class_id else [0, 0, 255]
    cv2.rectangle(image_cpy, (xmin, ymin), (xmin + w, ymin + h), color, 2)


good job!

I was using this code for inferring trt engine made with yolo. I received this error,

inputs, outputs, bindings, stream = allocate_buffers(trt_engine)
  File "", line 66, in allocate_buffers
dtype = binding_to_type[str(binding)]
KeyError: 'Input'

Has anyone came across something like this?

change model input layer name to the model summary input key on your exported engine file, you can get this detail on your training log

I used the reference code above but there is a discrepancy in the output images. How do I get the correct bbox?

what is the use of this step because if i remove […, ::-1]` i am getting different set of results

By default OpenCV returns the image in BGR format. Nonetheless, our model expects the image as RGB, so that […, ::-1] is reversing the BGR order to RGB.


@carlos.alvarez did u run the int8 pruned model on your system what is the inference time ur getting compared to other pruned model given by nvidia

Sorry, I only run it with float16 precision and didn’t compare inference times

wokay thanks do u knw what changes to be made in the code to run the int8 models

Sorry, I do not know how would the code look like with int8 models, since for those AFAIK you also need a calibration file. Look at the tensorRT python API for int8 models how you would also use that file

1 Like

@carlos.alvarez Thanks for the code, really helped a lot!
@abhigoku10 You don’t have to make any changes to run the code with int8 models. You’ll have to make changes for batch inference…
I’ve run this on models trained with custom data with TLT, 4 classes with shape (480, 640, 3), final training for int8 with QAT enabled.
Inference times:
fp32 on Quadro P5000*: Batch size 1 = 2.69ms, batch size 32 = 2.2ms
int8 on Jetson AGX Xavier (MAXN): batch size 1 = 5.67ms, batch size 32 = 4.03ms

  • not able to run on fp16 nor int8 on this GPU, so benchmarking with engine generated after pruned training/before QAT training
1 Like

@Morganh what are the steps to train peoplenet ? i used the following

in which i am not able to pull the docker image nor able to get the “tlt-dataset-convert” exe offline , i am trying to train on GPU hardware sys : MX130 just to test the training and its results

Please create a new topic in TLT forum. Thanks.

Great Job @m.fiore and @carlos.alvarez , its working .