KeyError: 'input_face_images' when inferencing Fpenet on Jetson Nano

Hello,

i am trying to run this model here:

i have a NVIDIA jetson nano board with the following software versions:


Package: nvidia-jetpack
Version: 4.6-b197
Architecture: arm64
Maintainer: NVIDIA Corporation
Installed-Size: 194
Depends: nvidia-cuda (= 4.6-b197), nvidia-opencv (= 4.6-b197), nvidia-cudnn8 (= 4.6-b197), nvidia-tensorrt (= 4.6-b197), nvidia-visionworks (= 4.6-b197), nvidia-container (= 4.6-b197), nvidia-vpi (= 4.6-b197), nvidia-l4t-jetson-multimedia-api (>> 32.6-0), nvidia-l4t-jetson-multimedia-api (<< 32.7-0)
Homepage: http://developer.nvidia.com/jetson
Priority: standard
******** CUDA DPKG ********ii  cuda-command-line-tools-10-2                       10.2.460-1                                 arm64        CUDA command-line tools
ii  cuda-compiler-10-2                                 10.2.460-1                                 arm64        CUDA compiler
ii  cuda-cudart-10-2                                   10.2.300-1                                 arm64        CUDA Runtime native Libraries
ii  cuda-cudart-dev-10-2                               10.2.300-1                                 arm64        CUDA Runtime native dev links, headers
ii  cuda-cuobjdump-10-2                                10.2.300-1                                 arm64        CUDA cuobjdump
ii  cuda-cupti-10-2                                    10.2.300-1                                 arm64        CUDA profiling tools runtime libs.
ii  cuda-cupti-dev-10-2                                10.2.300-1                                 arm64        CUDA profiling tools interface.
ii  cuda-documentation-10-2                            10.2.300-1                                 arm64        CUDA documentation
ii  cuda-driver-dev-10-2                               10.2.300-1                                 arm64        CUDA Driver native dev stub library
ii  cuda-gdb-10-2                                      10.2.300-1                                 arm64        CUDA-GDB
ii  cuda-libraries-10-2                                10.2.460-1                                 arm64        CUDA Libraries 10.2 meta-package
ii  cuda-libraries-dev-10-2                            10.2.460-1                                 arm64        CUDA Libraries 10.2 development meta-package
ii  cuda-memcheck-10-2                                 10.2.300-1                                 arm64        CUDA-MEMCHECK
ii  cuda-nvcc-10-2                                     10.2.300-1                                 arm64        CUDA nvcc
ii  cuda-nvdisasm-10-2                                 10.2.300-1                                 arm64        CUDA disassembler
ii  cuda-nvgraph-10-2                                  10.2.300-1                                 arm64        NVGRAPH native runtime libraries
ii  cuda-nvgraph-dev-10-2                              10.2.300-1                                 arm64        NVGRAPH native dev links, headers
ii  cuda-nvml-dev-10-2                                 10.2.300-1                                 arm64        NVML native dev links, headers
ii  cuda-nvprof-10-2                                   10.2.300-1                                 arm64        CUDA Profiler tools
ii  cuda-nvprune-10-2                                  10.2.300-1                                 arm64        CUDA nvprune
ii  cuda-nvrtc-10-2                                    10.2.300-1                                 arm64        NVRTC native runtime libraries
ii  cuda-nvrtc-dev-10-2                                10.2.300-1                                 arm64        NVRTC native dev links, headers
ii  cuda-nvtx-10-2                                     10.2.300-1                                 arm64        NVIDIA Tools Extension
ii  cuda-repo-l4t-10-2-local                           10.2.460-1                                 arm64        cuda repository configuration files
ii  cuda-samples-10-2                                  10.2.300-1                                 arm64        CUDA example applications
ii  cuda-toolkit-10-2                                  10.2.460-1                                 arm64        CUDA Toolkit 10.2 meta-package
ii  cuda-tools-10-2                                    10.2.460-1                                 arm64        CUDA Tools meta-package
ii  cuda-visual-tools-10-2                             10.2.460-1                                 arm64        CUDA visual tools
ii  graphsurgeon-tf                                    8.0.1-1+cuda10.2                           arm64        GraphSurgeon for TensorRT package
ii  libcudnn8                                          8.2.1.32-1+cuda10.2                        arm64        cuDNN runtime libraries
ii  libcudnn8-dev                                      8.2.1.32-1+cuda10.2                        arm64        cuDNN development libraries and headers
ii  libcudnn8-samples                                  8.2.1.32-1+cuda10.2                        arm64        cuDNN documents and samples
ii  libnvinfer-bin                                     8.0.1-1+cuda10.2                           arm64        TensorRT binaries
ii  libnvinfer-dev                                     8.0.1-1+cuda10.2                           arm64        TensorRT development libraries and headers
ii  libnvinfer-doc                                     8.0.1-1+cuda10.2                           all          TensorRT documentation
ii  libnvinfer-plugin-dev                              8.0.1-1+cuda10.2                           arm64        TensorRT plugin libraries
ii  libnvinfer-plugin8                                 8.0.1-1+cuda10.2                           arm64        TensorRT plugin libraries
ii  libnvinfer-samples                                 8.0.1-1+cuda10.2                           all          TensorRT samples
ii  libnvinfer8                                        8.0.1-1+cuda10.2                           arm64        TensorRT runtime libraries
ii  libnvonnxparsers-dev                               8.0.1-1+cuda10.2                           arm64        TensorRT ONNX libraries
ii  libnvonnxparsers8                                  8.0.1-1+cuda10.2                           arm64        TensorRT ONNX libraries
ii  libnvparsers-dev                                   8.0.1-1+cuda10.2                           arm64        TensorRT parsers libraries
ii  libnvparsers8                                      8.0.1-1+cuda10.2                           arm64        TensorRT parsers libraries
ii  nvidia-container-csv-cuda                          10.2.460-1                                 arm64        Jetpack CUDA CSV file
ii  nvidia-container-csv-cudnn                         8.2.1.32-1+cuda10.2                        arm64        Jetpack CUDNN CSV file
ii  nvidia-container-csv-tensorrt                      8.0.1.6-1+cuda10.2                         arm64        Jetpack TensorRT CSV file
ii  nvidia-l4t-cuda                                    32.6.1-20210916211029                      arm64        NVIDIA CUDA Package
ii  python3-libnvinfer                                 8.0.1-1+cuda10.2                           arm64        Python 3 bindings for TensorRT
ii  python3-libnvinfer-dev                             8.0.1-1+cuda10.2                           arm64        Python 3 development package for TensorRT
ii  tensorrt                                           8.0.1.6-1+cuda10.2                         arm64        Meta package of TensorRT
ii  uff-converter-tf                                   8.0.1-1+cuda10.2                           arm64        UFF converter for TensorRT package

I converted the fpenet.etlt using tao-converter and the following parameter:


export TRT_LIB_PATH=”/usr/lib/aarch64-linux-gnu”
export TRT_INC_PATH=”/usr/include/aarch64-linux-gnu”
export INPUT_DIMENSIONS=1x1x80x80
export ENCODE_KEY=nvidia_tlt
export BATCH_SIZE=1
export ENGINE_FILE_PATH=/home/e/Desktop/fpenet/fpenet_fp32.engine
export MAX_BATCH_SIZE=1
export OUTPUTS=output_bbox/BiasAdd,output_cov/Sigmoid
export DATA_TYPE=fp32
export MAX_WORKSPACE_SIZE=1610612736
export MODEL_IN=/home/e/Desktop/fpenet/fpenet.etlt

./tao-converter \
    -d $INPUT_DIMENSIONS \
    -k $ENCODE_KEY \
    -b $BATCH_SIZE \
    -e $ENGINE_FILE_PATH \
    -m $MAX_BATCH_SIZE \
    -o $OUTPUTS \
    -t $DATA_TYPE \
    -w $MAX_WORKSPACE_SIZE \
    -p input_face_images,1x1x80x80,1x1x80x80,1x1x80x80 \
    $MODEL_IN

I checked the converted model using trtexec and it passed the test as shown below:

./trtexec --verbose  --minShapes=input:1x1x80x80 --optShapes=input:1x1x80x80 --maxShapes=input:2x1x80x80 --loadEngine=/home/e/Desktop/fpenet/fpenet_fp32.engine --batch=1
&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # ./trtexec --verbose --minShapes=input:1x1x80x80 --optShapes=input:1x1x80x80 --maxShapes=input:2x1x80x80 --loadEngine=/home/e/Desktop/fpenet/fpenet_fp32.engine --batch=1
[11/16/2023-06:46:55] [I] === Model Options ===
[11/16/2023-06:46:55] [I] Format: *
[11/16/2023-06:46:55] [I] Model: 
[11/16/2023-06:46:55] [I] Output:
[11/16/2023-06:46:55] [I] === Build Options ===
[11/16/2023-06:46:55] [I] Max batch: explicit
[11/16/2023-06:46:55] [I] Workspace: 16 MiB
[11/16/2023-06:46:55] [I] minTiming: 1
[11/16/2023-06:46:55] [I] avgTiming: 8
[11/16/2023-06:46:55] [I] Precision: FP32
[11/16/2023-06:46:55] [I] Calibration: 
[11/16/2023-06:46:55] [I] Refit: Disabled
[11/16/2023-06:46:55] [I] Sparsity: Disabled
[11/16/2023-06:46:55] [I] Safe mode: Disabled
[11/16/2023-06:46:55] [I] Restricted mode: Disabled
[11/16/2023-06:46:55] [I] Save engine: 
[11/16/2023-06:46:55] [I] Load engine: /home/e/Desktop/fpenet/fpenet_fp32.engine
[11/16/2023-06:46:55] [I] NVTX verbosity: 0
[11/16/2023-06:46:55] [I] Tactic sources: Using default tactic sources
[11/16/2023-06:46:55] [I] timingCacheMode: local
[11/16/2023-06:46:55] [I] timingCacheFile: 
[11/16/2023-06:46:55] [I] Input(s)s format: fp32:CHW
[11/16/2023-06:46:55] [I] Output(s)s format: fp32:CHW
[11/16/2023-06:46:55] [I] Input build shape: input=1x1x80x80+1x1x80x80+2x1x80x80
[11/16/2023-06:46:55] [I] Input calibration shapes: model
[11/16/2023-06:46:55] [I] === System Options ===
[11/16/2023-06:46:55] [I] Device: 0
[11/16/2023-06:46:55] [I] DLACore: 
[11/16/2023-06:46:55] [I] Plugins:
[11/16/2023-06:46:55] [I] === Inference Options ===
[11/16/2023-06:46:55] [I] Batch: Explicit
[11/16/2023-06:46:55] [I] Input inference shape: input=1x1x80x80
[11/16/2023-06:46:55] [I] Iterations: 10
[11/16/2023-06:46:55] [I] Duration: 3s (+ 200ms warm up)
[11/16/2023-06:46:55] [I] Sleep time: 0ms
[11/16/2023-06:46:55] [I] Streams: 1
[11/16/2023-06:46:55] [I] ExposeDMA: Disabled
[11/16/2023-06:46:55] [I] Data transfers: Enabled
[11/16/2023-06:46:55] [I] Spin-wait: Disabled
[11/16/2023-06:46:55] [I] Multithreading: Disabled
[11/16/2023-06:46:55] [I] CUDA Graph: Disabled
[11/16/2023-06:46:55] [I] Separate profiling: Disabled
[11/16/2023-06:46:55] [I] Time Deserialize: Disabled
[11/16/2023-06:46:55] [I] Time Refit: Disabled
[11/16/2023-06:46:55] [I] Skip inference: Disabled
[11/16/2023-06:46:55] [I] Inputs:
[11/16/2023-06:46:55] [I] === Reporting Options ===
[11/16/2023-06:46:55] [I] Verbose: Enabled
[11/16/2023-06:46:55] [I] Averages: 10 inferences
[11/16/2023-06:46:55] [I] Percentile: 99
[11/16/2023-06:46:55] [I] Dump refittable layers:Disabled
[11/16/2023-06:46:55] [I] Dump output: Disabled
[11/16/2023-06:46:55] [I] Profile: Disabled
[11/16/2023-06:46:55] [I] Export timing to JSON file: 
[11/16/2023-06:46:55] [I] Export output to JSON file: 
[11/16/2023-06:46:55] [I] Export profile to JSON file: 
[11/16/2023-06:46:55] [I] 
[11/16/2023-06:46:55] [I] === Device Information ===
[11/16/2023-06:46:55] [I] Selected Device: NVIDIA Tegra X1
[11/16/2023-06:46:55] [I] Compute Capability: 5.3
[11/16/2023-06:46:55] [I] SMs: 1
[11/16/2023-06:46:55] [I] Compute Clock Rate: 0.9216 GHz
[11/16/2023-06:46:55] [I] Device Global Memory: 1978 MiB
[11/16/2023-06:46:55] [I] Shared Memory per SM: 64 KiB
[11/16/2023-06:46:55] [I] Memory Bus Width: 64 bits (ECC disabled)
[11/16/2023-06:46:55] [I] Memory Clock Rate: 0.01275 GHz
[11/16/2023-06:46:55] [I] 
[11/16/2023-06:46:55] [I] TensorRT version: 8001
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::Region_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::Clip_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::LReLU_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::ScatterND version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::CropAndResize version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::EfficientNMS_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::Proposal version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::Split version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1
[11/16/2023-06:47:05] [I] [TRT] [MemUsageChange] Init CUDA: CPU +203, GPU +0, now: CPU 226, GPU 1894 (MiB)
[11/16/2023-06:47:05] [I] [TRT] Loaded engine size: 4 MB
[11/16/2023-06:47:05] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 226 MiB, GPU 1896 MiB
[11/16/2023-06:47:05] [W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[11/16/2023-06:47:13] [V] [TRT] Using cublas a tactic source
[11/16/2023-06:47:14] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +158, GPU -8, now: CPU 384, GPU 1893 (MiB)
[11/16/2023-06:47:14] [V] [TRT] Using cuDNN as a tactic source
[11/16/2023-06:47:53] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +241, GPU +52, now: CPU 625, GPU 1945 (MiB)
[11/16/2023-06:48:04] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 624, GPU 1938 (MiB)
[11/16/2023-06:48:04] [V] [TRT] Deserialization required 59070374 microseconds.
[11/16/2023-06:48:04] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 624 MiB, GPU 1938 MiB
[11/16/2023-06:48:05] [I] Engine loaded in 69.7267 sec.
[11/16/2023-06:48:05] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 619 MiB, GPU 1937 MiB
[11/16/2023-06:48:05] [V] [TRT] Using cublas a tactic source
[11/16/2023-06:48:05] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU -6, now: CPU 620, GPU 1931 (MiB)
[11/16/2023-06:48:05] [V] [TRT] Using cuDNN as a tactic source
[11/16/2023-06:48:05] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +1, now: CPU 620, GPU 1932 (MiB)
[11/16/2023-06:48:05] [V] [TRT] Total per-runner device memory is 4277248
[11/16/2023-06:48:05] [V] [TRT] Total per-runner host memory is 28400
[11/16/2023-06:48:05] [V] [TRT] Allocated activation device memory of size 7917056
[11/16/2023-06:48:18] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 622 MiB, GPU 1903 MiB
[11/16/2023-06:48:19] [I] Created input binding for input_face_images with dimensions 1x1x80x80
[11/16/2023-06:48:19] [I] Created output binding for conv_keypoints_m80 with dimensions 1x80x80x80
[11/16/2023-06:48:19] [I] Created output binding for softargmax with dimensions 1x80x2
[11/16/2023-06:48:19] [I] Created output binding for softargmax:1 with dimensions 1x80
[11/16/2023-06:48:19] [I] Starting inference
[11/16/2023-06:50:50] [I] Warmup completed 1 queries over 200 ms
[11/16/2023-06:50:50] [I] Timing trace has 10 queries over 8.46087 s
[11/16/2023-06:50:50] [I] 
[11/16/2023-06:50:50] [I] === Trace details ===
[11/16/2023-06:50:50] [I] Trace averages of 10 runs:
[11/16/2023-06:50:50] [I] Average on 10 runs - GPU latency: 669.589 ms - Host latency: 669.9 ms (end to end 669.975 ms, enqueue 643.141 ms)
[11/16/2023-06:50:50] [I] 
[11/16/2023-06:50:50] [I] === Performance summary ===
[11/16/2023-06:50:50] [I] Throughput: 1.18191 qps
[11/16/2023-06:50:50] [I] Latency: min = 14.4062 ms, max = 6501.84 ms, mean = 669.9 ms, median = 20.4219 ms, percentile(99%) = 6501.84 ms
[11/16/2023-06:50:50] [I] End-to-End Host Latency: min = 14.4375 ms, max = 6502.11 ms, mean = 669.975 ms, median = 20.4375 ms, percentile(99%) = 6502.11 ms
[11/16/2023-06:50:50] [I] Enqueue Time: min = 2.25 ms, max = 6407.3 ms, mean = 643.141 ms, median = 2.64844 ms, percentile(99%) = 6407.3 ms
[11/16/2023-06:50:50] [I] H2D Latency: min = 0 ms, max = 0.171875 ms, mean = 0.0265625 ms, median = 0.015625 ms, percentile(99%) = 0.171875 ms
[11/16/2023-06:50:50] [I] GPU Compute Time: min = 14.2031 ms, max = 6501.42 ms, mean = 669.589 ms, median = 20.1406 ms, percentile(99%) = 6501.42 ms
[11/16/2023-06:50:50] [I] D2H Latency: min = 0.203125 ms, max = 0.390625 ms, mean = 0.284375 ms, median = 0.273438 ms, percentile(99%) = 0.390625 ms
[11/16/2023-06:50:50] [I] Total Host Walltime: 8.46087 s
[11/16/2023-06:50:50] [I] Total GPU Compute Time: 6.69589 s
[11/16/2023-06:50:50] [I] Explanations of the performance metrics are printed in the verbose logs.
[11/16/2023-06:50:50] [V] 
[11/16/2023-06:50:50] [V] === Explanations of the performance metrics ===
[11/16/2023-06:50:50] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[11/16/2023-06:50:50] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[11/16/2023-06:50:50] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[11/16/2023-06:50:50] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[11/16/2023-06:50:50] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[11/16/2023-06:50:50] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[11/16/2023-06:50:50] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[11/16/2023-06:50:50] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[11/16/2023-06:50:50] [V] End-to-End Host Latency: the duration from when the H2D of a query is called to when the D2H of the same query is completed, which includes the latency to wait for the completion of the previous query. This is the latency of a query if multiple queries are enqueued consecutively.
[11/16/2023-06:50:50] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8001] # ./trtexec --verbose --minShapes=input:1x1x80x80 --optShapes=input:1x1x80x80 --maxShapes=input:2x1x80x80 --loadEngine=/home/e/Desktop/fpenet/fpenet_fp32.engine --batch=1
[11/16/2023-06:50:57] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 862, GPU 1946 (MiB)

If I inference the model with this test.py script with the script and pictures of a previous post( How to do inference with fpenet_fp32.trt ), I get an error.
This is the python code of the script:

import cv2
import numpy as np
import pycuda
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt
import time

from PIL import Image


class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()


class FpeNet(object):
    def __init__(self, trt_path, input_size=(80, 80), batch_size=1):
        self.trt_path = trt_path
        self.input_size = input_size
        self.batch_size = batch_size

        TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
        trt_runtime = trt.Runtime(TRT_LOGGER)
        self.trt_engine = self._load_engine(trt_runtime, self.trt_path)

        self.inputs, self.outputs, self.bindings, self.stream = \
            self._allocate_buffers()

        self.context = self.trt_engine.create_execution_context()
        self.list_output = None

    def _load_engine(self, trt_runtime, engine_path):
        with open(engine_path, "rb") as f:
            engine_data = f.read()
        engine = trt_runtime.deserialize_cuda_engine(engine_data)
        return engine

    def _allocate_buffers(self):
        inputs = []
        outputs = []
        bindings = []
        stream = cuda.Stream()

        binding_to_type = {
            "input_face_images:0": np.float32,
            "softargmax/strided_slice:0": np.float32,
            "softargmax/strided_slice_1:0": np.float32
        }

        for binding in self.trt_engine:
            size = trt.volume(self.trt_engine.get_binding_shape(binding)) \
                   * self.batch_size
            dtype = binding_to_type[str(binding)]
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            bindings.append(int(device_mem))
            if self.trt_engine.binding_is_input(binding):
                inputs.append(HostDeviceMem(host_mem, device_mem))
            else:
                outputs.append(HostDeviceMem(host_mem, device_mem))

        return inputs, outputs, bindings, stream

    def _do_inference(self, context, bindings, inputs,
                      outputs, stream):
        [cuda.memcpy_htod_async(inp.device, inp.host, stream) \
         for inp in inputs]
        context.execute_async(
            batch_size=self.batch_size, bindings=bindings,
            stream_handle=stream.handle)

        [cuda.memcpy_dtoh_async(out.host, out.device, stream) \
         for out in outputs]

        stream.synchronize()

        return [out.host for out in outputs]

    def _process_image(self, image):
        image = cv2.imread(image)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
        w = self.input_size[0]
        h = self.input_size[1]
        print("w", w)
        print("h", h)
        self.image_height = image.shape[0]
        self.image_width = image.shape[1]
        image_resized = Image.fromarray(np.uint8(image))
        image_resized = image_resized.resize(size=(w, h), resample=Image.BILINEAR)
        img_np = np.array(image_resized)
        img_np = img_np.astype(np.float32)
        img_np = np.expand_dims(img_np, axis=0)  # the shape would be 1x80x80

        return img_np, image

    def predict(self, img_path):
        img_processed, image = self._process_image(img_path)

        np.copyto(self.inputs[0].host, img_processed.ravel())
        t_time = 0
        landmarks = None

        for i in range(1):
            t1 = time.perf_counter()
            landmarks, probs = self._do_inference(
                self.context, bindings=self.bindings, inputs=self.inputs,
                outputs=self.outputs, stream=self.stream)
            t2 = time.perf_counter()
            t_time += (t2 - t1)
        print('inferece time:', t_time)

        # to make (x, y)s from the (160, ) output
        landmarks = landmarks.reshape(-1, 2)
        visualized = self._visualize(image, landmarks)

        return visualized

    @staticmethod
    def _postprocess(landmarks):
        landmarks = landmarks.reshape(-1, 2)
        return landmarks

    def _visualize(self, frame, landmarks):
        visualized = cv2.cvtColor(frame, cv2.COLOR_GRAY2BGR)
        for x, y in landmarks:
            x = x * self.image_width / self.input_size[0]
            y = y * self.image_height / self.input_size[1]
            x = int(x)
            y = int(y)
            cv2.circle(visualized, (x, y), 1, (0, 255, 0), 1)
        return visualized


if __name__ == '__main__':
    import argparse

    arg_parser = argparse.ArgumentParser()
    arg_parser.add_argument('--input', '-i', type=str, required=True)
    args = arg_parser.parse_args()
    img_path = args.input

    fpenet_obj = FpeNet('/home/e/Desktop/fpenet/fpenet_fp32.engine')
    output = fpenet_obj.predict(img_path)
    cv2.imwrite('landmarks.jpg', output)
    print('image has been writen to landmarks.jpg')

This is the exact error message output when I run the script:

python3 test.py --input test.png
[TensorRT] WARNING: Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
Traceback (most recent call last):
  File "test.py", line 150, in <module>
    fpenet_obj = FpeNet('/home/e/Desktop/fpenet/fpenet_fp32.engine')
  File "test.py", line 35, in __init__
    self._allocate_buffers()
  File "test.py", line 61, in _allocate_buffers
    dtype = binding_to_type[str(binding)]
KeyError: 'input_face_images'
[TensorRT] INTERNAL ERROR: [defaultAllocator.cpp::free::85] Error Code 1: Cuda Runtime (invalid argument)

Segmentation fault (core dumped)

Does anyone have any idea what went wrong ? I would be nice if someone could help me out here.

Kind regards
Emanuel

You can change
"input_face_images:0": np.float32,
to
"input_face_images": np.float32,

and retry.

More info can be found in https://github.com/NVIDIA/tao_tensorflow1_backend/blob/main/nvidia_tao_tf1/cv/fpenet/inferencer/fpenet_inferencer.py.

Hello,
thank you for your instant reply. You were right. The bindings were incorrect.
I changed them from

        binding_to_type = {
            "input_face_images": np.float32,
            "softargmax/strided_slice": np.float32,
            "softargmax/strided_slice_1": np.float32
        }

to

        binding_to_type = {
            "input_face_images": np.float32,
            "conv_keypoints_m80": np.float32,
            "softargmax": np.float32,
            "softargmax:1": np.float32

This solved the problem. It inferences now…
Thanks Morganh!

Kind regards,
Emanuel

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.