Installing Tao-converter and running: Where is the "Encoding key" of FPEnet?

Hi. I want to convert the fpenet model.tlt to trt engine and use it in a deepstream python environment on jetson nx. I have a Ubuntu 22.04 LTS 64-bit os with a GeForce RTX 2070 and Intel® Core™ i7-9750H CPU @ 2.60GHz × 12

I have my ngc api-key, I have the virtual env named launcher. Tao is installed in virtual env “launcher” and working.

Now I want to install tao-converter- I have downloaded the " cuda113-cudnn80-trt72" ( My sytstem has cuda 11.6 and no cudnn yet. But shouldnt I be able to install tao-converter? I have the file downloaded, but even after chod, the file tao-coverter cannot be run, or tao-converter canot be installed

Is there any way to install tao-converter? (If easy preferable)

$chmod + /filepath/to/filename
makes the file executeable

to execute it in linux, I had to type $./file -h

As I wanted to convert the etlt file to trt engine and use it on jetson, I also had to convert it on jetson nx and not on x86 machine(info from forums) … So the tao-converter can run…But what how do I know the Encoding key of FPEnet…the ngc api key is not the right one isn it?

The key is: nvidia_tlt
and this is valid for all etlt to trt conversions…

my command is as follows now:
tao-converter -k nvidia_tlt -t fp32 -p input_face_images:0,1x1x80x80,1x1x80x80,2x1x80x80 -e /models/triton_model_repository/faciallandmarks_tlt/1/model.plan -b 1 /home/eren/FPEnet/model.etlt

but I get this error:
[ERROR] 1: Unexpected exception _Map_base::at
[ERROR] Unable to create engine

Hi,
Are you downloading the correct version of tao-converter for Jetson NX?

Many thanks you for help… Yes I was using the true version. As I wanted to use the converted trt.engine file in my jetson nx, I had to convert it on it. (where this information was not obvious and difficult to find in NVDIA pages.)

I checked the jetpack version with “sudo apt-cache show nvidia-jetpack” and actually downloaded the latest one from " TensorRT — TAO Toolkit 3.22.05 documentation

Eventually I converted the etlt model to trt.engine with the following command:

tao-converter -k nvidia_tlt -t fp16 -p input_face_images:0,1x1x80x80,1x1x80x80,2x1x80x80 -e /target/path/folder -m 1 -w 1000000000 /path/to/etlt_file/to_be_converted/model.etlt

-w was needed for unnown reasons… others in the forum did not need that… I hope that the engine file works

Please share the full log. Thanks.

I tried again, without "-w " and it converted without problems… Thanks… I still share the log…(Is there better way to share log?)

$ tao-converter -k nvidia_tlt -t fp16 -p input_face_images:0,1x1x80x80,1x1x80x80,2x1x80x80 -e /home/eren/FPEnet/model.engine -m 1 /home/eren/FPEnet/model.etlt
[INFO] [MemUsageChange] Init CUDA: CPU +363, GPU +0, now: CPU 381, GPU 6718 (MiB)
[INFO] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 381 MiB, GPU 6748 MiB
[INFO] [MemUsageSnapshot] End constructing builder kernel library: CPU 486 MiB, GPU 6858 MiB
[INFO] ----------------------------------------------------------------
[INFO] Input filename: /tmp/file9da4fw
[INFO] ONNX IR version: 0.0.5
[INFO] Opset version: 10
[INFO] Producer name: tf2onnx
[INFO] Producer version: 1.6.3
[INFO] Domain:
[INFO] Model version: 0
[INFO] Doc string:
[INFO] ----------------------------------------------------------------
[WARNING] onnx2trt_utils.cpp:366: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[INFO] Detected input dimensions from the model: (-1, 1, 80, 80)
[INFO] Model has dynamic shape. Setting up optimization profiles.
[INFO] Using optimization profile min shape: (1, 1, 80, 80) for input: input_face_images:0
[INFO] Using optimization profile opt shape: (1, 1, 80, 80) for input: input_face_images:0
[INFO] Using optimization profile max shape: (2, 1, 80, 80) for input: input_face_images:0
[WARNING] DLA requests all profiles have same min, max, and opt value. All dla layers are falling back to GPU
[INFO] ---------- Layers Running on DLA ----------
[INFO] ---------- Layers Running on GPU ----------
[INFO] [GpuLayer] row_indexes:0
[INFO] [GpuLayer] column_indexes:0
[INFO] [GpuLayer] block_1a_conv_1/Pad + block_1a_conv_1/BiasAdd + activation_1/Relu
[INFO] [GpuLayer] (Unnamed Layer* 7) [Identity]
[INFO] [GpuLayer] max_pooling2d_1/MaxPool
[INFO] [GpuLayer] block_2a_conv_1/Pad + block_2a_conv_1/BiasAdd + activation_2/Relu
[INFO] [GpuLayer] (Unnamed Layer* 16) [Identity]
[INFO] [GpuLayer] max_pooling2d_2/MaxPool
[INFO] [GpuLayer] block_3a_conv_1/Pad + block_3a_conv_1/BiasAdd + activation_3/Relu
[INFO] [GpuLayer] (Unnamed Layer* 25) [Identity]
[INFO] [GpuLayer] max_pooling2d_3/MaxPool
[INFO] [GpuLayer] block_4a_conv_1/Pad + block_4a_conv_1/BiasAdd + activation_4/Relu
[INFO] [GpuLayer] (Unnamed Layer* 34) [Identity]
[INFO] [GpuLayer] max_pooling2d_4/MaxPool
[INFO] [GpuLayer] block_5a_conv_1/Pad + block_5a_conv_1/BiasAdd + activation_5/Relu
[INFO] [GpuLayer] block_5a_conv_2/Pad + block_5a_conv_2/BiasAdd + activation_6/Relu
[INFO] [GpuLayer] block_5a_conv_3/convolution + activation_7/Relu
[INFO] [GpuLayer] conv2d_transpose_1/conv2d_transpose
[INFO] [GpuLayer] conv2d_transpose_1/conv2d_transpose:0 copy
[INFO] [GpuLayer] block_6a_conv_1/Pad + block_6a_conv_1/BiasAdd + activation_8/Relu
[INFO] [GpuLayer] block_6a_conv_2/convolution + activation_9/Relu
[INFO] [GpuLayer] conv2d_transpose_2/conv2d_transpose
[INFO] [GpuLayer] conv2d_transpose_2/conv2d_transpose:0 copy
[INFO] [GpuLayer] block_7a_conv_1/Pad + block_7a_conv_1/BiasAdd + activation_10/Relu
[INFO] [GpuLayer] block_7a_conv_2/convolution + activation_11/Relu
[INFO] [GpuLayer] conv2d_transpose_3/conv2d_transpose
[INFO] [GpuLayer] conv2d_transpose_3/conv2d_transpose:0 copy
[INFO] [GpuLayer] block_8a_conv_1/Pad + block_8a_conv_1/BiasAdd + activation_12/Relu
[INFO] [GpuLayer] block_8a_conv_2/convolution + activation_13/Relu
[INFO] [GpuLayer] conv2d_transpose_4/conv2d_transpose
[INFO] [GpuLayer] conv2d_transpose_4/conv2d_transpose:0 copy
[INFO] [GpuLayer] block_9a_conv_1/Pad + block_9a_conv_1/BiasAdd + activation_14/Relu
[INFO] [GpuLayer] block_9a_conv_2/convolution + activation_15/Relu
[INFO] [GpuLayer] conv_keypoints_m80/convolution
[INFO] [GpuLayer] softargmax/Max
[INFO] [GpuLayer] softargmax/Max_1
[INFO] [GpuLayer] PWN(PWN(softargmax/sub, softargmax/mul/x:0 + (Unnamed Layer* 415) [Shuffle] + softargmax/mul), softargmax/Exp)
[INFO] [GpuLayer] softargmax/Sum
[INFO] [GpuLayer] softargmax/Sum_1
[INFO] [GpuLayer] softargmax/truediv
[INFO] [GpuLayer] softargmax/mul_2
[INFO] [GpuLayer] softargmax/mul_1
[INFO] [GpuLayer] softargmax/Max_2
[INFO] [GpuLayer] softargmax/Sum_4
[INFO] [GpuLayer] softargmax/Sum_2
[INFO] [GpuLayer] softargmax/Sum_5
[INFO] [GpuLayer] softargmax/Sum_3
[INFO] [GpuLayer] softargmax/Sum_3:0 copy
[INFO] [GpuLayer] softargmax/Sum_5:0 copy
[INFO] [GpuLayer] softargmax/Max_2:0 copy
[INFO] [GpuLayer] softargmax/Squeeze
[INFO] [GpuLayer] softargmax/strided_slice_1
[INFO] [GpuLayer] softargmax/strided_slice
[INFO] [GpuLayer] softargmax/strided_slice_1__242
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +227, GPU +166, now: CPU 716, GPU 7032 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +307, GPU -601, now: CPU 1023, GPU 6431 (MiB)
[INFO] Local timing cache in use. Profiling results in this builder pass will not be stored.
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[INFO] Detected 1 inputs and 2 output network tensors.
[INFO] Total Host Persistent Memory: 37056
[INFO] Total Device Persistent Memory: 1027072
[INFO] Total Scratch Memory: 2048000
[INFO] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 677 MiB
[INFO] [BlockAssignment] Algorithm ShiftNTopDown took 11.0284ms to assign 7 blocks to 51 nodes requiring 7917056 bytes.
[INFO] Total Activation Memory: 7917056
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +3, now: CPU 1488, GPU 6992 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1488, GPU 6992 (MiB)
[INFO] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +1, GPU +4, now: CPU 1, GPU 4 (MiB)

Thanks for the info. For sharing log, you can also click “upload” button
image
to attach log file.

I used the test.py that was used in the forums previosly, but it gave an pycude memory error. What could be the problem???

eren@erennx:~$ /home/eren/env/bin/python /home/eren/FPEnet/test.py --input facepic.jpg
Traceback (most recent call last):
File “/home/eren/FPEnet/test.py”, line 148, in
fpenet_obj = FpeNet(‘/home/eren/FPEnet/model.trt’)
File “/home/eren/FPEnet/test.py”, line 35, in init
self._allocate_buffers()
File “/home/eren/FPEnet/test.py”, line 62, in _allocate_buffers
host_mem = cuda.pagelocked_empty(size, dtype)
pycuda._driver.MemoryError: cuMemHostAlloc failed: out of memory
[06/23/2022-20:58:54] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::35] Error Code 1: Cuda Runtime (invalid argument)
Segmentation fault (core dumped)

the code is as below…

import cv2
import numpy as np
import pycuda
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt
import time

from PIL import Image

class HostDeviceMem(object):
def init(self, host_mem, device_mem):
self.host = host_mem
self.device = device_mem

def __str__(self):
    return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

def __repr__(self):
    return self.__str__()

class FpeNet(object):
def init(self, trt_path, input_size=(80, 80), batch_size=1):
self.trt_path = trt_path
self.input_size = input_size
self.batch_size = batch_size

    TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
    trt_runtime = trt.Runtime(TRT_LOGGER)
    self.trt_engine = self._load_engine(trt_runtime, self.trt_path)

    self.inputs, self.outputs, self.bindings, self.stream = \
        self._allocate_buffers()

    self.context = self.trt_engine.create_execution_context()
    self.list_output = None

def _load_engine(self, trt_runtime, engine_path):
    with open(engine_path, "rb") as f:
        engine_data = f.read()
    engine = trt_runtime.deserialize_cuda_engine(engine_data)
    return engine

def _allocate_buffers(self):
    inputs = []
    outputs = []
    bindings = []
    stream = cuda.Stream()

    binding_to_type = {
        "input_face_images:0": np.float32,
        "softargmax/strided_slice:0": np.float32,
        "softargmax/strided_slice_1:0": np.float32
    }

    for binding in self.trt_engine:
        size = trt.volume(self.trt_engine.get_binding_shape(binding)) \
               * self.batch_size
        dtype = binding_to_type[str(binding)]
        host_mem = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        bindings.append(int(device_mem))
        if self.trt_engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
        else:
            outputs.append(HostDeviceMem(host_mem, device_mem))

    return inputs, outputs, bindings, stream

def _do_inference(self, context, bindings, inputs,
                  outputs, stream):
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) \
     for inp in inputs]
    context.execute_async(
        batch_size=self.batch_size, bindings=bindings,
        stream_handle=stream.handle)

    [cuda.memcpy_dtoh_async(out.host, out.device, stream) \
     for out in outputs]

    stream.synchronize()

    return [out.host for out in outputs]

def _process_image(self, image):
    image = cv2.imread(image)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    w = self.input_size[0]
    h = self.input_size[1]
    self.image_height = image.shape[0]
    self.image_width = image.shape[1]
    image_resized = Image.fromarray(np.uint8(image))
    image_resized = image_resized.resize(size=(w, h), resample=Image.BILINEAR)
    img_np = np.array(image_resized)
    img_np = img_np.astype(np.float32) #/ 255  #this was corrected in a forum
    img_np = np.expand_dims(img_np, axis=0)  # the shape would be 1x80x80

    return img_np, image

def predict(self, img_path):
    img_processed, image = self._process_image(img_path)

    np.copyto(self.inputs[0].host, img_processed.ravel())
    t_time = 0
    landmarks = None

    for i in range(1):
        t1 = time.perf_counter()
        landmarks, probs = self._do_inference(
            self.context, bindings=self.bindings, inputs=self.inputs,
            outputs=self.outputs, stream=self.stream)
        t2 = time.perf_counter()
        t_time += (t2 - t1)
    print('inferece time:', t_time)

    # to make (x, y)s from the (160, ) output
    landmarks = landmarks.reshape(-1, 2)
    visualized = self._visualize(image, landmarks)

    return visualized

@staticmethod
def _postprocess(landmarks):
    landmarks = landmarks.reshape(-1, 2)
    return landmarks

def _visualize(self, frame, landmarks):
    visualized = cv2.cvtColor(frame, cv2.COLOR_GRAY2BGR)
    for x, y in landmarks:
        x = x * self.image_width / self.input_size[0]
        y = y * self.image_height / self.input_size[1]
        x = int(x)
        y = int(y)
        cv2.circle(visualized, (x, y), 1, (0, 255, 0), 1)
    return visualized

if name == ‘main’:
import argparse

arg_parser = argparse.ArgumentParser()
arg_parser.add_argument('--input', '-i', type=str, required=True)
args = arg_parser.parse_args()
img_path = args.input

fpenet_obj = FpeNet('/home/eren/FPEnet/model.trt')
output = fpenet_obj.predict(img_path)
cv2.imwrite('landmarks.jpg', output)
print('image has been writen to landmarks.jpg')