Cannot infer with fpenet with TensorRT8.0

Continuing the discussion from How to do inference with fpenet_fp32.trt:

Please provide the following information when requesting support.

• Hardware: GTX 1070Ti
• Network Type: FpeNet
• TLT Version: tao-converter(tao-converter-x86-tensorrt8.0)
• How to reproduce the issue ?
I convert FpeNet to TensorRT by following steps:

  1. Download model: Facial Landmarks Estimation | NVIDIA NGC
  2. Convert to TensorRT by the command: ./tao-converter model.etlt -k nvidia_tlt -p input_face_images:0,1x1x80x80,1x1x80x80,1x1x80x80 -b 1 -t fp32 -e fpenet_b1_fp32_v2.trt
    Output Logs:

[INFO] [MemUsageChange] Init CUDA: CPU +160, GPU +0, now: CPU 166, GPU 982 (MiB)
[INFO] ----------------------------------------------------------------
[INFO] Input filename: /tmp/fileHbd8d0
[INFO] ONNX IR version: 0.0.5
[INFO] Opset version: 10
[INFO] Producer name: tf2onnx
[INFO] Producer version: 1.6.3
[INFO] Domain:
[INFO] Model version: 0
[INFO] Doc string:
[INFO] ----------------------------------------------------------------
[WARNING] /trt_oss_src/TensorRT/parsers/onnx/onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[INFO] Detected input dimensions from the model: (-1, 1, 80, 80)
[INFO] Model has dynamic shape. Setting up optimization profiles.
[INFO] Using optimization profile min shape: (1, 1, 80, 80) for input: input_face_images:0
[INFO] Using optimization profile opt shape: (1, 1, 80, 80) for input: input_face_images:0
[INFO] Using optimization profile max shape: (1, 1, 80, 80) for input: input_face_images:0
[INFO] [MemUsageSnapshot] Builder begin: CPU 168 MiB, GPU 982 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +231, GPU +96, now: CPU 399, GPU 1078 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +185, GPU +82, now: CPU 584, GPU 1160 (MiB)
[WARNING] Detected invalid timing cache, setup a local cache instead
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[INFO] Detected 1 inputs and 2 output network tensors.
[INFO] Total Host Persistent Memory: 29792
[INFO] Total Device Persistent Memory: 2642432
[INFO] Total Scratch Memory: 2048000
[INFO] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 4 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 773, GPU 1235 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 773, GPU 1243 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 773, GPU 1227 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 772, GPU 1209 (MiB)
[INFO] [MemUsageSnapshot] Builder end: CPU 772 MiB, GPU 1209 MiB

  1. Run test.py: python test.py --input test.png
    test

I got this error

inferece time: 0.001052992000040831
image has been writen to landmarks.jpg
[TensorRT] ERROR: 1: [hardwareContext.cpp::terminateCommonContext::141] Error Code 1: Cuda Runtime (invalid device context)
[TensorRT] INTERNAL ERROR: [defaultAllocator.cpp::free::85] Error Code 1: Cuda Runtime (invalid argument)
Segmentation fault (core dumped)

I run everything in docker: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3

Thanks,

Please retry with nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3

Hi @Morganh
I got the same issue

root@92ca64683a61:/workspace/tao-experiments/fpenet/test_tensorrt# python test.py --input test.png
inferece time: 0.0009723749990371289
image has been writen to landmarks.jpg
[TensorRT] ERROR: 1: [hardwareContext.cpp::terminateCommonContext::141] Error Code 1: Cuda Runtime (invalid device context)
[TensorRT] INTERNAL ERROR: [defaultAllocator.cpp::free::85] Error Code 1: Cuda Runtime (invalid argument)
Segmentation fault (core dumped)

What is "$nvidia-smi " and "$dpkg -l |grep cuda "?
And how did you trigger this tao docker?

Hi @Morganh ,

Did you generate trt engine again inside nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3 ?

Yes I did

And here is the code

import cv2
import numpy as np
import pycuda
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt
import time

from PIL import Image


class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()


class FpeNet(object):
    def __init__(self, trt_path, input_size=(80, 80), batch_size=1):
        self.trt_path = trt_path
        self.input_size = input_size
        self.batch_size = batch_size

        TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
        trt_runtime = trt.Runtime(TRT_LOGGER)
        self.trt_engine = self._load_engine(trt_runtime, self.trt_path)

        self.inputs, self.outputs, self.bindings, self.stream = \
            self._allocate_buffers()

        self.context = self.trt_engine.create_execution_context()
        self.list_output = None

    def _load_engine(self, trt_runtime, engine_path):
        with open(engine_path, "rb") as f:
            engine_data = f.read()
        engine = trt_runtime.deserialize_cuda_engine(engine_data)
        return engine

    def _allocate_buffers(self):
        inputs = []
        outputs = []
        bindings = []
        stream = cuda.Stream()

        binding_to_type = {
            "input_face_images:0": np.float32,
            "softargmax/strided_slice:0": np.float32,
            "softargmax/strided_slice_1:0": np.float32
        }

        for binding in self.trt_engine:
            size = trt.volume(self.trt_engine.get_binding_shape(binding)) \
                   * self.batch_size
            dtype = binding_to_type[str(binding)]
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            bindings.append(int(device_mem))
            if self.trt_engine.binding_is_input(binding):
                inputs.append(HostDeviceMem(host_mem, device_mem))
            else:
                outputs.append(HostDeviceMem(host_mem, device_mem))

        return inputs, outputs, bindings, stream

    def _do_inference(self, context, bindings, inputs,
                      outputs, stream):
        [cuda.memcpy_htod_async(inp.device, inp.host, stream) \
         for inp in inputs]
        context.execute_async(
            batch_size=self.batch_size, bindings=bindings,
            stream_handle=stream.handle)

        [cuda.memcpy_dtoh_async(out.host, out.device, stream) \
         for out in outputs]

        stream.synchronize()

        return [out.host for out in outputs]

    def _process_image(self, image):
        image = cv2.imread(image)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
        w = self.input_size[0]
        h = self.input_size[1]
        self.image_height = image.shape[0]
        self.image_width = image.shape[1]
        image_resized = Image.fromarray(np.uint8(image))
        image_resized = image_resized.resize(size=(w, h), resample=Image.BILINEAR)
        img_np = np.array(image_resized)
        img_np = img_np.astype(np.float32)
        img_np = np.expand_dims(img_np, axis=0)  # the shape would be 1x80x80

        return img_np, image

    def predict(self, img_path):
        img_processed, image = self._process_image(img_path)

        np.copyto(self.inputs[0].host, img_processed.ravel())
        t_time = 0
        landmarks = None

        for i in range(1):
            t1 = time.perf_counter()
            landmarks, probs = self._do_inference(
                self.context, bindings=self.bindings, inputs=self.inputs,
                outputs=self.outputs, stream=self.stream)
            t2 = time.perf_counter()
            t_time += (t2 - t1)
        print('inferece time:', t_time)

        # to make (x, y)s from the (160, ) output
        landmarks = landmarks.reshape(-1, 2)
        visualized = self._visualize(image, landmarks)

        return visualized

    @staticmethod
    def _postprocess(landmarks):
        landmarks = landmarks.reshape(-1, 2)
        return landmarks

    def _visualize(self, frame, landmarks):
        visualized = cv2.cvtColor(frame, cv2.COLOR_GRAY2BGR)
        for x, y in landmarks:
            x = x * self.image_width / self.input_size[0]
            y = y * self.image_height / self.input_size[1]
            x = int(x)
            y = int(y)
            cv2.circle(visualized, (x, y), 1, (0, 255, 0), 1)
        return visualized


if __name__ == '__main__':
    import argparse

    arg_parser = argparse.ArgumentParser()
    arg_parser.add_argument('--input', '-i', type=str, required=True)
    args = arg_parser.parse_args()
    img_path = args.input

    fpenet_obj = FpeNet('fpenet_b1_fp32_v3.trt')
    output = fpenet_obj.predict(img_path)
    cv2.imwrite('landmarks.jpg', output)
    print('image has been writen to landmarks.jpg')

Please try to debug and find which line results in above error.

Hi @Morganh ,
The landmarks output, which I get, is totally wrong:
I print the landmarks in _visualize function(before drawring it in the image):

0 0
0 0
0 0
0 1
1 0
0 0
0 0
0 0
0 2
2 2
0 1
1 1
0 1
1 1
0 1
1 1
1 1
0 1
2 0
1 0
0 1
1 1
1 1
0 2
1 0
1 1
1 1
2 1
0 0
1 0
2 1
1 1
1 0
1 1
1 1
1 1
1 2
1 2
0 0
1 3

Please try to debug and find which line results in above error.
It’s the end of the code. The error is printed out when the script writes out the output image

image has been writen to landmarks.jpg
[TensorRT] ERROR: 1: [hardwareContext.cpp::terminateCommonContext::141] Error Code 1: Cuda Runtime (invalid device context)
[TensorRT] INTERNAL ERROR: [defaultAllocator.cpp::free::85] Error Code 1: Cuda Runtime (invalid argument)
Segmentation fault (core dumped)

Is landmarks.jpg expected?

No. All landmarks are in range [0,3]
landmarks

To narrow down, please run “tao fpenet inference” to check if it works.
https://docs.nvidia.com/tao/tao-toolkit/text/facial_landmarks_estimation/facial_landmarks_estimation.html#inference-of-the-model

I used “tao fpenet inference” to check it.
I used this command:

!tao fpenet inference -e $SPECS_DIR/experiment_spec.yaml
-i $SPECS_DIR/inference_sample.json
-r $LOCAL_PROJECT_DIR
-m $USER_EXPERIMENT_DIR/tao-converter-x86-tensorrt8.0/fpenet_b1_fp32_v3.engine
-o $USER_EXPERIMENT_DIR/tao-converter-x86-tensorrt8.0
-k $KEY

Here is the error:

2022-03-03 06:47:32,267 [INFO] driveix.common.inferencer.trt_inferencer: Loading TensorRT engine: /workspace/tao-experiments/fpenet/tao-converter-x86-tensorrt8.0/fpenet_b1_fp32_v3.engine
0%| | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/fpenet/scripts/inference.py”, line 115, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/fpenet/scripts/inference.py”, line 109, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/fpenet/inferencer/fpenet_inferencer.py”, line 169, in infer_model
AssertionError: Number of outputs more than 2. Please verify.
Traceback (most recent call last):
File “/usr/local/bin/fpenet”, line 8, in
sys.exit(main())
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/fpenet/entrypoint/fpenet.py”, line 12, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/common/entrypoint/entrypoint.py”, line 300, in launch_job
AssertionError: Process run failed.
2022-03-03 13:47:33,938 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
This is the command, I used to convert etlt model to .engine file:

./tao-converter model.etlt -k nvidia_tlt -p input_face_images:0,1x1x80x80,1x1x80x80,1x1x80x80 -b 1 -t fp32 -e fpenet_b1_fp32_v3.engine

Here is the output of the above command:

[INFO] [MemUsageChange] Init CUDA: CPU +160, GPU +0, now: CPU 166, GPU 1055 (MiB)
[INFO] ----------------------------------------------------------------
[INFO] Input filename: /tmp/fileovkvDS
[INFO] ONNX IR version: 0.0.5
[INFO] Opset version: 10
[INFO] Producer name: tf2onnx
[INFO] Producer version: 1.6.3
[INFO] Domain:
[INFO] Model version: 0
[INFO] Doc string:
[INFO] ----------------------------------------------------------------
[WARNING] /trt_oss_src/TensorRT/parsers/onnx/onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[INFO] Detected input dimensions from the model: (-1, 1, 80, 80)
[INFO] Model has dynamic shape. Setting up optimization profiles.
[INFO] Using optimization profile min shape: (1, 1, 80, 80) for input: input_face_images:0
[INFO] Using optimization profile opt shape: (1, 1, 80, 80) for input: input_face_images:0
[INFO] Using optimization profile max shape: (1, 1, 80, 80) for input: input_face_images:0
[INFO] [MemUsageSnapshot] Builder begin: CPU 167 MiB, GPU 1055 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +231, GPU +96, now: CPU 399, GPU 1151 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +185, GPU +82, now: CPU 584, GPU 1233 (MiB)
[WARNING] Detected invalid timing cache, setup a local cache instead
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[INFO] Detected 1 inputs and 2 output network tensors.
[INFO] Total Host Persistent Memory: 34032
[INFO] Total Device Persistent Memory: 2232320
[INFO] Total Scratch Memory: 2048000
[INFO] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 4 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 773, GPU 1292 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 773, GPU 1300 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 773, GPU 1284 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 772, GPU 1266 (MiB)
[INFO] [MemUsageSnapshot] Builder end: CPU 772 MiB, GPU 1266 MiB

After checking, previous docker works. nvcr.io/nvidia/tlt-streamanalytics:v3.0-dp-py3

More, for current version, can you run official inference way with deepstream? See GitHub - NVIDIA-AI-IOT/deepstream_tao_apps: Sample apps to demonstrate how to deploy models trained with TAO on DeepStream and deepstream_tao_apps/faciallandmark_sgie_config.txt at master · NVIDIA-AI-IOT/deepstream_tao_apps · GitHub

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.