PeopleNet. Coverage output is always zero

Hi. I’m trying to inference pruned peoplenet model using tensor rt, but always get zero coverage output. So, i downloaded peoplenet tlt using command from this topic: How to run tlt-converter

After that i converted .eltl to engine using next command:

./tlt-converter /home/bronstein/tlt-experiments/resnet34_peoplenet_pruned.etlt -k tlt_encode -o output_cov/Sigmoid,output_bbox/BiasAdd -d 3,544,960 -i nchw -e /home/bronstein/tlt-experiments/engine/peoplenet.engine -m 1 -t fp16

Then i tried to inference this model using image from peoplenet main page:

Here is code which i used:

import numpy as np
import cv2
import time

import tensorrt as trt
import pycuda.autoinit
import pycuda.driver as cuda


TRT_LOGGER = trt.Logger(trt.Logger.INFO)
runtime = trt.Runtime(TRT_LOGGER)

host_inputs  = []
cuda_inputs  = []
host_outputs = []
cuda_outputs = []
bindings = []


def Inference(engine):

    im = cv2.imread('input_11ft45deg_000070.jpg')
    # im = cv2.resize(im, (640, 640))
    im = cv2.resize(im, (960, 544))
    im = np.asarray(im).astype(np.float32)
    im = im.transpose(2,0,1) / 255
    print(np.shape(host_inputs[0]), np.shape(im))
    np.copyto(host_inputs[0], im.ravel())
    stream = cuda.Stream()
    context = engine.create_execution_context()

    start_time = time.time()
    cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)
    context.execute_async(bindings=bindings, stream_handle=stream.handle)
    cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)
    stream.synchronize()
    print("execute times "+str(time.time()-start_time))

    print(host_outputs[1], np.max(host_outputs[1]))


def PrepareEngine():
    # deserialize engine
    with open('peoplenet.engine', 'rb') as f:
        buf = f.read()
    engine = runtime.deserialize_cuda_engine(buf)

    # create buffer
    for binding in engine:
        size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
        host_mem = cuda.pagelocked_empty(shape=[size],dtype=np.float32)
        cuda_mem = cuda.mem_alloc(host_mem.nbytes)

        bindings.append(int(cuda_mem))

        if engine.binding_is_input(binding):
            print(engine.get_binding_shape(binding))
            host_inputs.append(host_mem)
            cuda_inputs.append(cuda_mem)
        else:
            print(engine.get_binding_shape(binding))
            host_outputs.append(host_mem)
            cuda_outputs.append(cuda_mem)

    return engine


if __name__ == "__main__":

    engine = PrepareEngine()
    Inference(engine)

And np.max(host_outputs[1]) gives me 0.0. What am i doing wrong?

You can refer to an SSD example https://github.com/NVIDIA/object-detection-tensorrt-example/blob/master/SSD_Model/utils/inference.py
But please note that that is for SSD, its preprocess is different from peoplenet(actually detectnet_v2 network)
For postprocess in python, more info can be seen in Run PeopleNet with tensorrt

Hi. Thanks for your answer. I have already seen these topics and my code was written according to them.

  1. Preprocessing for image:
    im = cv2.imread('input_11ft45deg_000070.jpg')
    im = cv2.resize(im, (960, 544))
    im = np.asarray(im).astype(np.float32)
    im = im.transpose(2,0,1) / 255
    print(np.shape(host_inputs[0]), np.shape(im))
    np.copyto(host_inputs[0], im.ravel())

So it’s same as in your post Run PeopleNet with tensorrt

Filling host/device buffers:

    for binding in engine:
        size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
        host_mem = cuda.pagelocked_empty(shape=[size],dtype=np.float32)
        cuda_mem = cuda.mem_alloc(host_mem.nbytes)

        bindings.append(int(cuda_mem))

        if engine.binding_is_input(binding):
            print(engine.get_binding_shape(binding))
            host_inputs.append(host_mem)
            cuda_inputs.append(cuda_mem)
        else:
            print(engine.get_binding_shape(binding))
            host_outputs.append(host_mem)
            cuda_outputs.append(cuda_mem)

So, it’s is same as in SSD example https://github.com/NVIDIA/object-detection-tensorrt-example/blob/master/SSD_Model/utils/engine.py

That’s how i’m doing inference:

    stream = cuda.Stream()
    context = engine.create_execution_context()

    start_time = time.time()
    cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)
    context.execute_async(bindings=bindings, stream_handle=stream.handle)
    cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)
    stream.synchronize()
    print("execute times "+str(time.time()-start_time))

    print(host_outputs[1], np.max(host_outputs[1]))

It’s also same as in SSD example, I really can’t find any difference…

Firstly, please run tlt-infer to check your peoplenet.engine.
Below is working on my side.

Inference command:

tlt-infer detectnet_v2 -e infer_spec.txt -k tlt_encode -o infer_result -i input_11ft45deg_000070.jpg

Spec:

inferencer_config{
#defining target class names for the experiment.
#Note: This must be mentioned in order of the networks classes.

target_classes: “Person”
target_classes: “Bag”
target_classes: “Face”
#Inference dimensions.

image_width: 960
image_height: 544
#Must match what the model was trained for.

image_channels: 3
batch_size: 1
gpu_index: 0

#model handler config

tensorrt_config{
trt_engine: “./peoplenet.engine”
}
}
bbox_handler_config{
kitti_dump: true
disable_overlay: false
overlay_linewidth: 2

classwise_bbox_handler_config{
key:“Person”
value: {
confidence_model: “aggregate_cov”
output_map: “Person”
confidence_threshold: 0.9
bbox_color{
R: 0
G: 255
B: 0
}
clustering_config{
coverage_threshold: 0.005
dbscan_eps: 0.3
dbscan_min_samples: 0.05
minimum_bounding_box_height: 4
}
}
}

classwise_bbox_handler_config{
key:“Face”
value: {
confidence_model: “aggregate_cov”
output_map: “Face”
confidence_threshold: 0.9
bbox_color{
R: 255
G: 0
B: 0
}
clustering_config{
coverage_threshold: 0.005
dbscan_eps: 0.3
dbscan_min_samples: 0.05
minimum_bounding_box_height: 4
}
}
}

classwise_bbox_handler_config{
key:“Bag”
value: {
confidence_model: “aggregate_cov”
output_map: “Bag”
confidence_threshold: 0.9
bbox_color{
R: 0
G: 0
B: 255
}
clustering_config{
coverage_threshold: 0.005
dbscan_eps: 0.3
dbscan_min_samples: 0.05
minimum_bounding_box_height: 4
}
}
}
}

Log:

Using TensorFlow backend.
2020-11-24 07:56:01.212334: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-11-24 07:56:03,880 [INFO] iva.detectnet_v2.scripts.inference: Overlain images will be saved in the output path.
2020-11-24 07:56:03,880 [DEBUG] iva.detectnet_v2.inferencer.build_inferencer: Initializing Tensorrt inferencer.
2020-11-24 07:56:03,880 [INFO] iva.detectnet_v2.inferencer.build_inferencer: Constructing inferencer
2020-11-24 07:56:04,364 [INFO] iva.detectnet_v2.inferencer.trt_inferencer: Reading from engine file at: ./peoplenet.engine
[TensorRT] WARNING: Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-11-24 07:56:05,068 [DEBUG] iva.detectnet_v2.inferencer.trt_inferencer: Generated TRT execution context.
2020-11-24 07:56:05,068 [DEBUG] iva.detectnet_v2.inferencer.trt_inferencer: Binding name: input_1, size: 1566720
2020-11-24 07:56:05,070 [DEBUG] iva.detectnet_v2.inferencer.trt_inferencer: Binding name: output_bbox/BiasAdd, size: 24480
2020-11-24 07:56:05,070 [DEBUG] iva.detectnet_v2.inferencer.trt_inferencer: Binding name: output_cov/Sigmoid, size: 6120
2020-11-24 07:56:05,071 [INFO] iva.detectnet_v2.scripts.inference: Initialized model
2020-11-24 07:56:05,071 [INFO] iva.detectnet_v2.scripts.inference: Commencing inference
0%| | 0/1 [00:00<?, ?it/s]2020-11-24 07:56:05,076 [DEBUG] iva.detectnet_v2.scripts.inference: Time lapsed to prepare batch: 0.005053043365478516
2020-11-24 07:56:05,077 [DEBUG] iva.detectnet_v2.inferencer.trt_inferencer: Inferring images
2020-11-24 07:56:05,132 [DEBUG] iva.detectnet_v2.inferencer.trt_inferencer: Number of input blobs 1
2020-11-24 07:56:05,140 [DEBUG] iva.detectnet_v2.inferencer.trt_inferencer: Number of outputs: 2
2020-11-24 07:56:05,140 [DEBUG] iva.detectnet_v2.inferencer.trt_inferencer: Output shape: (24480,), (12, 34, 60)
2020-11-24 07:56:05,140 [DEBUG] iva.detectnet_v2.inferencer.trt_inferencer: Output shape: (6120,), (3, 34, 60)
2020-11-24 07:56:05,140 [DEBUG] iva.detectnet_v2.inferencer.trt_inferencer: Coverage blob shape: (1, 12, 34, 60)
2020-11-24 07:56:05,140 [DEBUG] iva.detectnet_v2.inferencer.trt_inferencer: Inferred_outputs: 2
2020-11-24 07:56:05,140 [DEBUG] iva.detectnet_v2.scripts.inference: Time lapsed to infer batch: 0.06382989883422852
2020-11-24 07:56:05,140 [DEBUG] iva.detectnet_v2.scripts.inference: Preprocessing complete
2020-11-24 07:56:05,141 [DEBUG] iva.detectnet_v2.postprocessor.bbox_handler: Clustering bboxes Person
2020-11-24 07:56:05,141 [DEBUG] iva.detectnet_v2.postprocessor.utilities: Clustering bboxes
2020-11-24 07:56:05,141 [DEBUG] iva.detectnet_v2.postprocessor.utilities: Clustering bboxes using dbscan.
2020-11-24 07:56:05,147 [DEBUG] iva.detectnet_v2.postprocessor.utilities: Number of boxes: 15
2020-11-24 07:56:05,148 [DEBUG] iva.detectnet_v2.postprocessor.bbox_handler: Clustering bboxes Bag
2020-11-24 07:56:05,148 [DEBUG] iva.detectnet_v2.postprocessor.utilities: Clustering bboxes
2020-11-24 07:56:05,148 [DEBUG] iva.detectnet_v2.postprocessor.utilities: Clustering bboxes using dbscan.
2020-11-24 07:56:05,150 [DEBUG] iva.detectnet_v2.postprocessor.utilities: Number of boxes: 16
2020-11-24 07:56:05,151 [DEBUG] iva.detectnet_v2.postprocessor.bbox_handler: Clustering bboxes Face
2020-11-24 07:56:05,151 [DEBUG] iva.detectnet_v2.postprocessor.utilities: Clustering bboxes
2020-11-24 07:56:05,151 [DEBUG] iva.detectnet_v2.postprocessor.utilities: Clustering bboxes using dbscan.
2020-11-24 07:56:05,152 [DEBUG] iva.detectnet_v2.postprocessor.utilities: Number of boxes: 5
2020-11-24 07:56:05,153 [DEBUG] iva.detectnet_v2.scripts.inference: Classwise_detections
2020-11-24 07:56:05,153 [DEBUG] iva.detectnet_v2.scripts.inference: Postprocessing detections: overlaying, metadata and crops.
2020-11-24 07:56:05,661 [DEBUG] iva.detectnet_v2.scripts.inference: Time lapsed: 0.5080950260162354
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.69it/s]
2020-11-24 07:56:05,661 [INFO] iva.detectnet_v2.inferencer.trt_inferencer: Clearing input buffers.
2020-11-24 07:56:05,662 [INFO] iva.detectnet_v2.inferencer.trt_inferencer: Clearing output buffers.
2020-11-24 07:56:05,662 [INFO] iva.detectnet_v2.inferencer.trt_inferencer: Clearing tensorrt runtime.
2020-11-24 07:56:05,662 [INFO] iva.detectnet_v2.inferencer.trt_inferencer: Clearing tensorrt context.
2020-11-24 07:56:05,662 [INFO] iva.detectnet_v2.inferencer.trt_inferencer: Clearing tensorrt engine.
2020-11-24 07:56:05,662 [INFO] iva.detectnet_v2.scripts.inference: Inference complete

Thanks for your reply. But how can i do that if i have different versions of TensorRT on host machine and on docker tlt container? To convert engine, i copied tlt-converter utility to host machine and executed it here…

I suggest you running all the commands in the tlt-docker.
BTW, as mentioned above, I generate trt engine and run the inference inside the tlt 2.0_py3 docker.

# cat generate_trt_engine.sh
tlt-converter resnet34_peoplenet_pruned.etlt -k tlt_encode -o output_cov/Sigmoid,output_bbox/BiasAdd -d 3,544,960 -i nchw -e peoplenet.engine -m 1 -t fp16

When i try to use engine generated inside tlt container i get this errors when try to parse it:

[TensorRT] ERROR: coreReadArchive.cpp (41) - Serialization Error in verifyHeader: 0 (Version tag does not match. Note: Current Version: 96, Serialized Engine Version: 89)
[TensorRT] ERROR: INVALID_STATE: std::exception
[TensorRT] ERROR: INVALID_CONFIG: Deserialize the cuda engine failed.

Probably i need to downgrade my trt version on host machine? But which vesrion i shoud choose?
On host-machine i have 7.2.1.6 and in container 7.0.0-1

I think you in your tlt docker, you should run tlt-converter and tlt-infer successfully. Because this are actually the steps which are the same as jupyter notebook.
Below is my step for your reference.

  1. Login docker
    $ docker run --runtime=nvidia -it -v ~/myfolder:/workspace/tlt-experiments nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3

In the docker, run below,

  1. download jpg and etlt file

wget https://developer.nvidia.com/sites/default/files/akamai/NGC_Images/models/peoplenet/input_11ft45deg_000070.jpg

wget https://api.ngc.nvidia.com/v2/models/nvidia/tlt_peoplenet/versions/pruned_v2.0/files/resnet34_peoplenet_pruned.etlt

  1. generate trt engine

tlt-converter resnet34_peoplenet_pruned.etlt -k tlt_encode -o output_cov/Sigmoid,output_bbox/BiasAdd -d 3,544,960 -i nchw -e peoplenet.engine -m 1 -t fp16

  1. run inference

tlt-infer detectnet_v2 -e infer_spec.txt -k tlt_encode -o infer_result -i input_11ft45deg_000070.jpg -v

I did as you said. Inference with tlt-infer gives correct result. But what shoud i do with inference on host machine? Because a can’t use same engine as inside tlt docker…

Why do you need to run again on host machine? Actually you have run it in the tlt docker of your host machine.

See “-v ~/myfolder:/workspace/tlt-experiments”, if you generate the trt engine under the folder “/workspace/tlt-experiments”, then the trt engine is actually available under the folder “myfolder”.
You have run inference with the trt engine.

because i need to use this engine in my application. in this way after i trained and pruned some model in tlt (for example peoplenet) i need to get engine which will be work on my host machine, but when i generated engine using tlt-converter on host machine and tried to use it, i got an incorrect output (see my first post). When i’m trying to use engine generated inside tlt docker on my host machine a can’t parse it due to error on this line:

engine = runtime.deserialize_cuda_engine(buf)

[TensorRT] ERROR: coreReadArchive.cpp (41) - Serialization Error in verifyHeader: 0 (Version tag does not match. Note: Current Version: 96, Serialized Engine Version: 89)
[TensorRT] ERROR: INVALID_STATE: std::exception
[TensorRT] ERROR: INVALID_CONFIG: Deserialize the cuda engine failed.

Please modify your code.

cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)
context.execute_async(bindings=bindings, stream_handle=stream.handle)
cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)
stream.synchronize()

to

cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)
context.execute_async(bindings=bindings, stream_handle=stream.handle)
cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)
cuda.memcpy_dtoh_async(host_outputs[1], cuda_outputs[1], stream)
stream.synchronize()