Cannot infer with fpenet with TensorRT8.0

chuongvodoi95 · March 2, 2022, 9:03am

Continuing the discussion from How to do inference with fpenet_fp32.trt:

Please provide the following information when requesting support.

• Hardware: GTX 1070Ti
• Network Type: FpeNet
• TLT Version: tao-converter(tao-converter-x86-tensorrt8.0)
• How to reproduce the issue ?
I convert FpeNet to TensorRT by following steps:

Download model: https://catalog.ngc.nvidia.com/orgs/nvidia/models/tlt_fpenet/files?version=deployable_v1.0
Convert to TensorRT by the command: ./tao-converter model.etlt -k nvidia_tlt -p input_face_images:0,1x1x80x80,1x1x80x80,1x1x80x80 -b 1 -t fp32 -e fpenet_b1_fp32_v2.trt
Output Logs:

[INFO] [MemUsageChange] Init CUDA: CPU +160, GPU +0, now: CPU 166, GPU 982 (MiB)
[INFO] ----------------------------------------------------------------
[INFO] Input filename: /tmp/fileHbd8d0
[INFO] ONNX IR version: 0.0.5
[INFO] Opset version: 10
[INFO] Producer name: tf2onnx
[INFO] Producer version: 1.6.3
[INFO] Domain:
[INFO] Model version: 0
[INFO] Doc string:
[INFO] ----------------------------------------------------------------
[WARNING] /trt_oss_src/TensorRT/parsers/onnx/onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[INFO] Detected input dimensions from the model: (-1, 1, 80, 80)
[INFO] Model has dynamic shape. Setting up optimization profiles.
[INFO] Using optimization profile min shape: (1, 1, 80, 80) for input: input_face_images:0
[INFO] Using optimization profile opt shape: (1, 1, 80, 80) for input: input_face_images:0
[INFO] Using optimization profile max shape: (1, 1, 80, 80) for input: input_face_images:0
[INFO] [MemUsageSnapshot] Builder begin: CPU 168 MiB, GPU 982 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +231, GPU +96, now: CPU 399, GPU 1078 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +185, GPU +82, now: CPU 584, GPU 1160 (MiB)
[WARNING] Detected invalid timing cache, setup a local cache instead
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[INFO] Detected 1 inputs and 2 output network tensors.
[INFO] Total Host Persistent Memory: 29792
[INFO] Total Device Persistent Memory: 2642432
[INFO] Total Scratch Memory: 2048000
[INFO] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 4 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 773, GPU 1235 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 773, GPU 1243 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 773, GPU 1227 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 772, GPU 1209 (MiB)
[INFO] [MemUsageSnapshot] Builder end: CPU 772 MiB, GPU 1209 MiB

Run test.py: python test.py --input test.png

I got this error

inferece time: 0.001052992000040831
image has been writen to landmarks.jpg
[TensorRT] ERROR: 1: [hardwareContext.cpp::terminateCommonContext::141] Error Code 1: Cuda Runtime (invalid device context)
[TensorRT] INTERNAL ERROR: [defaultAllocator.cpp::free::85] Error Code 1: Cuda Runtime (invalid argument)
Segmentation fault (core dumped)

I run everything in docker: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3

Thanks,

Morganh · March 2, 2022, 9:07am

Please retry with nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3

chuongvodoi95 · March 2, 2022, 9:17am

Hi @Morganh
I got the same issue

root@92ca64683a61:/workspace/tao-experiments/fpenet/test_tensorrt# python test.py --input test.png
inferece time: 0.0009723749990371289
image has been writen to landmarks.jpg
[TensorRT] ERROR: 1: [hardwareContext.cpp::terminateCommonContext::141] Error Code 1: Cuda Runtime (invalid device context)
[TensorRT] INTERNAL ERROR: [defaultAllocator.cpp::free::85] Error Code 1: Cuda Runtime (invalid argument)
Segmentation fault (core dumped)

Morganh · March 2, 2022, 9:21am

What is "$nvidia-smi " and "$dpkg -l |grep cuda "?
And how did you trigger this tao docker?

chuongvodoi95 · March 2, 2022, 9:25am

Hi @Morganh ,

nvidia-smi

image730×356 33.2 KB
dpkg -l |grep cuda

image1378×793 202 KB
start docker

image1560×110 33.3 KB

image1915×296 56.9 KB

Morganh · March 2, 2022, 9:53am

Did you generate trt engine again inside nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3 ?

chuongvodoi95 · March 2, 2022, 9:58am

Yes I did

And here is the code

import cv2
import numpy as np
import pycuda
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt
import time

from PIL import Image


class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()


class FpeNet(object):
    def __init__(self, trt_path, input_size=(80, 80), batch_size=1):
        self.trt_path = trt_path
        self.input_size = input_size
        self.batch_size = batch_size

        TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
        trt_runtime = trt.Runtime(TRT_LOGGER)
        self.trt_engine = self._load_engine(trt_runtime, self.trt_path)

        self.inputs, self.outputs, self.bindings, self.stream = \
            self._allocate_buffers()

        self.context = self.trt_engine.create_execution_context()
        self.list_output = None

    def _load_engine(self, trt_runtime, engine_path):
        with open(engine_path, "rb") as f:
            engine_data = f.read()
        engine = trt_runtime.deserialize_cuda_engine(engine_data)
        return engine

    def _allocate_buffers(self):
        inputs = []
        outputs = []
        bindings = []
        stream = cuda.Stream()

        binding_to_type = {
            "input_face_images:0": np.float32,
            "softargmax/strided_slice:0": np.float32,
            "softargmax/strided_slice_1:0": np.float32
        }

        for binding in self.trt_engine:
            size = trt.volume(self.trt_engine.get_binding_shape(binding)) \
                   * self.batch_size
            dtype = binding_to_type[str(binding)]
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            bindings.append(int(device_mem))
            if self.trt_engine.binding_is_input(binding):
                inputs.append(HostDeviceMem(host_mem, device_mem))
            else:
                outputs.append(HostDeviceMem(host_mem, device_mem))

        return inputs, outputs, bindings, stream

    def _do_inference(self, context, bindings, inputs,
                      outputs, stream):
        [cuda.memcpy_htod_async(inp.device, inp.host, stream) \
         for inp in inputs]
        context.execute_async(
            batch_size=self.batch_size, bindings=bindings,
            stream_handle=stream.handle)

        [cuda.memcpy_dtoh_async(out.host, out.device, stream) \
         for out in outputs]

        stream.synchronize()

        return [out.host for out in outputs]

    def _process_image(self, image):
        image = cv2.imread(image)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
        w = self.input_size[0]
        h = self.input_size[1]
        self.image_height = image.shape[0]
        self.image_width = image.shape[1]
        image_resized = Image.fromarray(np.uint8(image))
        image_resized = image_resized.resize(size=(w, h), resample=Image.BILINEAR)
        img_np = np.array(image_resized)
        img_np = img_np.astype(np.float32)
        img_np = np.expand_dims(img_np, axis=0)  # the shape would be 1x80x80

        return img_np, image

    def predict(self, img_path):
        img_processed, image = self._process_image(img_path)

        np.copyto(self.inputs[0].host, img_processed.ravel())
        t_time = 0
        landmarks = None

        for i in range(1):
            t1 = time.perf_counter()
            landmarks, probs = self._do_inference(
                self.context, bindings=self.bindings, inputs=self.inputs,
                outputs=self.outputs, stream=self.stream)
            t2 = time.perf_counter()
            t_time += (t2 - t1)
        print('inferece time:', t_time)

        # to make (x, y)s from the (160, ) output
        landmarks = landmarks.reshape(-1, 2)
        visualized = self._visualize(image, landmarks)

        return visualized

    @staticmethod
    def _postprocess(landmarks):
        landmarks = landmarks.reshape(-1, 2)
        return landmarks

    def _visualize(self, frame, landmarks):
        visualized = cv2.cvtColor(frame, cv2.COLOR_GRAY2BGR)
        for x, y in landmarks:
            x = x * self.image_width / self.input_size[0]
            y = y * self.image_height / self.input_size[1]
            x = int(x)
            y = int(y)
            cv2.circle(visualized, (x, y), 1, (0, 255, 0), 1)
        return visualized


if __name__ == '__main__':
    import argparse

    arg_parser = argparse.ArgumentParser()
    arg_parser.add_argument('--input', '-i', type=str, required=True)
    args = arg_parser.parse_args()
    img_path = args.input

    fpenet_obj = FpeNet('fpenet_b1_fp32_v3.trt')
    output = fpenet_obj.predict(img_path)
    cv2.imwrite('landmarks.jpg', output)
    print('image has been writen to landmarks.jpg')

Morganh · March 2, 2022, 10:00am

Please try to debug and find which line results in above error.

chuongvodoi95 · March 2, 2022, 10:06am

Hi @Morganh ,
The landmarks output, which I get, is totally wrong:
I print the landmarks in _visualize function(before drawring it in the image):

Please try to debug and find which line results in above error.
It’s the end of the code. The error is printed out when the script writes out the output image

image has been writen to landmarks.jpg
[TensorRT] ERROR: 1: [hardwareContext.cpp::terminateCommonContext::141] Error Code 1: Cuda Runtime (invalid device context)
[TensorRT] INTERNAL ERROR: [defaultAllocator.cpp::free::85] Error Code 1: Cuda Runtime (invalid argument)
Segmentation fault (core dumped)

Morganh · March 2, 2022, 10:14am

Is landmarks.jpg expected?

chuongvodoi95 · March 2, 2022, 10:16am

No. All landmarks are in range [0,3]

Morganh · March 2, 2022, 10:31am

To narrow down, please run “tao fpenet inference” to check if it works.
https://docs.nvidia.com/tao/tao-toolkit/text/facial_landmarks_estimation/facial_landmarks_estimation.html#inference-of-the-model

chuongvodoi95 · March 3, 2022, 7:56am

I used “tao fpenet inference” to check it.
I used this command:

!tao fpenet inference -e $SPECS_DIR/experiment_spec.yaml
-i $SPECS_DIR/inference_sample.json
-r $LOCAL_PROJECT_DIR
-m $USER_EXPERIMENT_DIR/tao-converter-x86-tensorrt8.0/fpenet_b1_fp32_v3.engine
-o $USER_EXPERIMENT_DIR/tao-converter-x86-tensorrt8.0
-k $KEY

Here is the error:

2022-03-03 06:47:32,267 [INFO] driveix.common.inferencer.trt_inferencer: Loading TensorRT engine: /workspace/tao-experiments/fpenet/tao-converter-x86-tensorrt8.0/fpenet_b1_fp32_v3.engine
0%| | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/fpenet/scripts/inference.py”, line 115, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/fpenet/scripts/inference.py”, line 109, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/fpenet/inferencer/fpenet_inferencer.py”, line 169, in infer_model
AssertionError: Number of outputs more than 2. Please verify.
Traceback (most recent call last):
File “/usr/local/bin/fpenet”, line 8, in
sys.exit(main())
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/fpenet/entrypoint/fpenet.py”, line 12, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/common/entrypoint/entrypoint.py”, line 300, in launch_job
AssertionError: Process run failed.
2022-03-03 13:47:33,938 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
This is the command, I used to convert etlt model to .engine file:

./tao-converter model.etlt -k nvidia_tlt -p input_face_images:0,1x1x80x80,1x1x80x80,1x1x80x80 -b 1 -t fp32 -e fpenet_b1_fp32_v3.engine

Here is the output of the above command:

[INFO] [MemUsageChange] Init CUDA: CPU +160, GPU +0, now: CPU 166, GPU 1055 (MiB)
[INFO] ----------------------------------------------------------------
[INFO] Input filename: /tmp/fileovkvDS
[INFO] ONNX IR version: 0.0.5
[INFO] Opset version: 10
[INFO] Producer name: tf2onnx
[INFO] Producer version: 1.6.3
[INFO] Domain:
[INFO] Model version: 0
[INFO] Doc string:
[INFO] ----------------------------------------------------------------
[WARNING] /trt_oss_src/TensorRT/parsers/onnx/onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[INFO] Detected input dimensions from the model: (-1, 1, 80, 80)
[INFO] Model has dynamic shape. Setting up optimization profiles.
[INFO] Using optimization profile min shape: (1, 1, 80, 80) for input: input_face_images:0
[INFO] Using optimization profile opt shape: (1, 1, 80, 80) for input: input_face_images:0
[INFO] Using optimization profile max shape: (1, 1, 80, 80) for input: input_face_images:0
[INFO] [MemUsageSnapshot] Builder begin: CPU 167 MiB, GPU 1055 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +231, GPU +96, now: CPU 399, GPU 1151 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +185, GPU +82, now: CPU 584, GPU 1233 (MiB)
[WARNING] Detected invalid timing cache, setup a local cache instead
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[INFO] Detected 1 inputs and 2 output network tensors.
[INFO] Total Host Persistent Memory: 34032
[INFO] Total Device Persistent Memory: 2232320
[INFO] Total Scratch Memory: 2048000
[INFO] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 4 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 773, GPU 1292 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 773, GPU 1300 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 773, GPU 1284 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 772, GPU 1266 (MiB)
[INFO] [MemUsageSnapshot] Builder end: CPU 772 MiB, GPU 1266 MiB

Morganh · March 3, 2022, 4:05pm

After checking, previous docker works. nvcr.io/nvidia/tlt-streamanalytics:v3.0-dp-py3

More, for current version, can you run official inference way with deepstream? See GitHub - NVIDIA-AI-IOT/deepstream_tao_apps: Sample apps to demonstrate how to deploy models trained with TAO on DeepStream and deepstream_tao_apps/faciallandmark_sgie_config.txt at master · NVIDIA-AI-IOT/deepstream_tao_apps · GitHub

system · March 17, 2022, 4:06pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
FPEnet model inference with TensorRT TAO Toolkit	10	1057	October 12, 2021
Inference problem with FPEnet Jetson Xavier NX jetson-inference	14	1052	July 28, 2022
KeyError: 'input_face_images' when inferencing Fpenet on Jetson Nano TAO Toolkit cudnn	3	403	November 17, 2023
Installing Tao-converter and running: Where is the "Encoding key" of FPEnet? TAO Toolkit tensorrt , tao	9	1159	July 7, 2022
How to do inference with fpenet_fp32.trt TAO Toolkit	21	2665	August 24, 2021
Poor Result After INT8 Optimization (TLT Getting Started Guide) TAO Toolkit	32	1461	October 12, 2021
Trtexec convert onnx to engine fails TAO Toolkit	14	1267	October 30, 2023
Not Getting Correct output while running inference using TensorRT on LPRnet fp16 Model TAO Toolkit	23	1546	September 27, 2021
TensorRT Inference error on Jetson nano Jetson Nano tensorrt	28	2940	February 1, 2022
The effect is very poor when converted to trt TAO Toolkit tensorrt , ubuntu	61	1438	September 11, 2023

Cannot infer with fpenet with TensorRT8.0

Related topics