Inference with Python scripts using .trt TensorRT on JetsonTX2

• Hardware: JetsonTX2
• Network Type
- EmotionNet
- HeartRateNet
- GazeEstimation
• TLT Version: v3.21.11-tf1.15.5
• Jetpack 4.4

I have managed to convert gaze model, emotionNet and heartRateNet as well as face detection and facial landmarks from model.etlt to TensorRT engine .trt file.

After the conversion my idea was to use trt models with Python scripts. I have found a way to load/inference FaceDetect & FacialLandmarks and works pretty good.

I have continued to work with EmotionNet but I am still not sure about preprocessing of facial landmarks as input to that model. I tried to normalized them, but still seem to get all zeros out from the model.

I am interested if you have some useful scripts that preprocess the data into these TensorRT models (GazeNet, EmotionNet, HeartRateNet) as a inference script. I have found the Jupyter notebooks but there isn’t nothing that helps with understanding how the models actually expect the data.

Best regards.

Currently, only below applications are available for reference.

At the moment I am approaching this using this python script. But still I get vector of zeros from inference on the tensor trt model.

import time
import itertools
import numpy as np
import pycuda.driver as cuda
import tensorrt as trt


class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()
    
    
class EmotionDetectNet(object):
    def __init__(self, trt_path, batch_size=1):
        self.trt_path = trt_path
        self.batch_size = batch_size

        TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
        trt_runtime = trt.Runtime(TRT_LOGGER)
        self.trt_engine = self._load_engine(trt_runtime, self.trt_path)

        self.inputs, self.outputs, self.bindings, self.stream = \
            self._allocate_buffers()

        self.context = self.trt_engine.create_execution_context()
        self.list_output = None

    def _load_engine(self, trt_runtime, engine_path):
        with open(engine_path, "rb") as f:
            engine_data = f.read()
        engine = trt_runtime.deserialize_cuda_engine(engine_data)
        return engine

    def _allocate_buffers(self):
        inputs = []
        outputs = []
        bindings = []
        stream = cuda.Stream()

        binding_to_type = {
            "input_landmarks:0": np.float32,
            "softmax/Softmax:0": np.float32
        }

        for binding in self.trt_engine:
            print("Binding: {}".format(binding))
            print("Binding shape: {}".format(self.trt_engine.get_binding_shape(binding)))
            size = trt.volume(self.trt_engine.get_binding_shape(binding)) \
                   * self.batch_size * -1
            dtype = binding_to_type[str(binding)]
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            bindings.append(int(device_mem))
            if self.trt_engine.binding_is_input(binding):
                inputs.append(HostDeviceMem(host_mem, device_mem))
            else:
                outputs.append(HostDeviceMem(host_mem, device_mem))

        return inputs, outputs, bindings, stream

    def _do_inference(self, context, bindings, inputs,
                      outputs, stream):
        [cuda.memcpy_htod_async(inp.device, inp.host, stream) \
         for inp in inputs]
        context.execute_async(
            batch_size=self.batch_size, bindings=bindings,
            stream_handle=stream.handle)

        [cuda.memcpy_dtoh_async(out.host, out.device, stream) \
         for out in outputs]

        stream.synchronize()
        return [out.host for out in outputs]
        

    def predict(self, facial_landmarks):
        # List of landmarks per face
        # Input landmarks are 80 tuple points, points after index 68 are the ones for eye pupil center, and we 
        # need only classical 68 facial landmarks points to estimate emotion
        for landmarks in facial_landmarks:
            
            input_landmarks = np.array(list(itertools.chain(*landmarks)))
            # Take only first 68 tuple points (68,2) into (136,1) vector as input to the emotion model
            input_landmarks = np.array(input_landmarks[0:136])
            # Normalize to {0, 1} range
            input_landmarks = (input_landmarks - np.min(input_landmarks)) / (np.max(input_landmarks) - np.min(input_landmarks))
            
            np.copyto(self.inputs[0].host, input_landmarks.ravel())
            t_time = 0
            for i in range(1):
                t1 = time.perf_counter()
                emotion_class = self._do_inference(
                    self.context, bindings=self.bindings, inputs=self.inputs,
                    outputs=self.outputs, stream=self.stream)
                t2 = time.perf_counter()
                t_time += (t2 - t1)
            print('Emotion detect inferece time:', t_time)
            
            # Remove batch dimension
            emotion_class = emotion_class[0]
            
            # Emotion Class order:
            # 0 - Neutral
            # 1 - Happy
            # 2 - Surprise
            # 3 - Squint
            # 4 - Disgust
            # 5 - Scream
            for e in emotion_class:
                print("Emotion: {}".format(e))
           
        return emotion_class

Emotionnet is actually a classification network.
Officially, you can refer to GitHub - NVIDIA-AI-IOT/tao-toolkit-triton-apps: Sample app code for deploying TAO Toolkit trained models to Triton
Or you can also search and find some topics in TAO forum. For example,

Inferring resnet18 classification etlt model with python - #40 by Morganh
Error while running inference, model generated through TLT using Opencv-Python - #3 by Morganh
TAO tensorRT model inferencing using python

Thank you for providing me with links.

I mean, I have a working example for face detection and facial landmarks, and I understand it is an classification network, but the output should be different than vector of zeros. At the moment I am feeding in values from Facial Landmark model (Facial Landmarks Estimation | NVIDIA NGC) as input to Emotion Detection Model. I have also tried points in original coordinated inside the 80x80 face box and same points scaled to [0, 1] range. It seems that either of this approaches is not giving correct results.

The input should be a face image. Could you try to run inference with “tao inference” firstly?
To check if your emotionnet tensorrt engine works.
https://docs.nvidia.com/tao/tao-toolkit/text/emotion_classification/emotion_classification.html#run-inference-on-the-model

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.