Hello,
i am trying to run this model here:
i have a NVIDIA jetson nano board with the following software versions:
Package: nvidia-jetpack
Version: 4.6-b197
Architecture: arm64
Maintainer: NVIDIA Corporation
Installed-Size: 194
Depends: nvidia-cuda (= 4.6-b197), nvidia-opencv (= 4.6-b197), nvidia-cudnn8 (= 4.6-b197), nvidia-tensorrt (= 4.6-b197), nvidia-visionworks (= 4.6-b197), nvidia-container (= 4.6-b197), nvidia-vpi (= 4.6-b197), nvidia-l4t-jetson-multimedia-api (>> 32.6-0), nvidia-l4t-jetson-multimedia-api (<< 32.7-0)
Homepage: http://developer.nvidia.com/jetson
Priority: standard
******** CUDA DPKG ********ii cuda-command-line-tools-10-2 10.2.460-1 arm64 CUDA command-line tools
ii cuda-compiler-10-2 10.2.460-1 arm64 CUDA compiler
ii cuda-cudart-10-2 10.2.300-1 arm64 CUDA Runtime native Libraries
ii cuda-cudart-dev-10-2 10.2.300-1 arm64 CUDA Runtime native dev links, headers
ii cuda-cuobjdump-10-2 10.2.300-1 arm64 CUDA cuobjdump
ii cuda-cupti-10-2 10.2.300-1 arm64 CUDA profiling tools runtime libs.
ii cuda-cupti-dev-10-2 10.2.300-1 arm64 CUDA profiling tools interface.
ii cuda-documentation-10-2 10.2.300-1 arm64 CUDA documentation
ii cuda-driver-dev-10-2 10.2.300-1 arm64 CUDA Driver native dev stub library
ii cuda-gdb-10-2 10.2.300-1 arm64 CUDA-GDB
ii cuda-libraries-10-2 10.2.460-1 arm64 CUDA Libraries 10.2 meta-package
ii cuda-libraries-dev-10-2 10.2.460-1 arm64 CUDA Libraries 10.2 development meta-package
ii cuda-memcheck-10-2 10.2.300-1 arm64 CUDA-MEMCHECK
ii cuda-nvcc-10-2 10.2.300-1 arm64 CUDA nvcc
ii cuda-nvdisasm-10-2 10.2.300-1 arm64 CUDA disassembler
ii cuda-nvgraph-10-2 10.2.300-1 arm64 NVGRAPH native runtime libraries
ii cuda-nvgraph-dev-10-2 10.2.300-1 arm64 NVGRAPH native dev links, headers
ii cuda-nvml-dev-10-2 10.2.300-1 arm64 NVML native dev links, headers
ii cuda-nvprof-10-2 10.2.300-1 arm64 CUDA Profiler tools
ii cuda-nvprune-10-2 10.2.300-1 arm64 CUDA nvprune
ii cuda-nvrtc-10-2 10.2.300-1 arm64 NVRTC native runtime libraries
ii cuda-nvrtc-dev-10-2 10.2.300-1 arm64 NVRTC native dev links, headers
ii cuda-nvtx-10-2 10.2.300-1 arm64 NVIDIA Tools Extension
ii cuda-repo-l4t-10-2-local 10.2.460-1 arm64 cuda repository configuration files
ii cuda-samples-10-2 10.2.300-1 arm64 CUDA example applications
ii cuda-toolkit-10-2 10.2.460-1 arm64 CUDA Toolkit 10.2 meta-package
ii cuda-tools-10-2 10.2.460-1 arm64 CUDA Tools meta-package
ii cuda-visual-tools-10-2 10.2.460-1 arm64 CUDA visual tools
ii graphsurgeon-tf 8.0.1-1+cuda10.2 arm64 GraphSurgeon for TensorRT package
ii libcudnn8 8.2.1.32-1+cuda10.2 arm64 cuDNN runtime libraries
ii libcudnn8-dev 8.2.1.32-1+cuda10.2 arm64 cuDNN development libraries and headers
ii libcudnn8-samples 8.2.1.32-1+cuda10.2 arm64 cuDNN documents and samples
ii libnvinfer-bin 8.0.1-1+cuda10.2 arm64 TensorRT binaries
ii libnvinfer-dev 8.0.1-1+cuda10.2 arm64 TensorRT development libraries and headers
ii libnvinfer-doc 8.0.1-1+cuda10.2 all TensorRT documentation
ii libnvinfer-plugin-dev 8.0.1-1+cuda10.2 arm64 TensorRT plugin libraries
ii libnvinfer-plugin8 8.0.1-1+cuda10.2 arm64 TensorRT plugin libraries
ii libnvinfer-samples 8.0.1-1+cuda10.2 all TensorRT samples
ii libnvinfer8 8.0.1-1+cuda10.2 arm64 TensorRT runtime libraries
ii libnvonnxparsers-dev 8.0.1-1+cuda10.2 arm64 TensorRT ONNX libraries
ii libnvonnxparsers8 8.0.1-1+cuda10.2 arm64 TensorRT ONNX libraries
ii libnvparsers-dev 8.0.1-1+cuda10.2 arm64 TensorRT parsers libraries
ii libnvparsers8 8.0.1-1+cuda10.2 arm64 TensorRT parsers libraries
ii nvidia-container-csv-cuda 10.2.460-1 arm64 Jetpack CUDA CSV file
ii nvidia-container-csv-cudnn 8.2.1.32-1+cuda10.2 arm64 Jetpack CUDNN CSV file
ii nvidia-container-csv-tensorrt 8.0.1.6-1+cuda10.2 arm64 Jetpack TensorRT CSV file
ii nvidia-l4t-cuda 32.6.1-20210916211029 arm64 NVIDIA CUDA Package
ii python3-libnvinfer 8.0.1-1+cuda10.2 arm64 Python 3 bindings for TensorRT
ii python3-libnvinfer-dev 8.0.1-1+cuda10.2 arm64 Python 3 development package for TensorRT
ii tensorrt 8.0.1.6-1+cuda10.2 arm64 Meta package of TensorRT
ii uff-converter-tf 8.0.1-1+cuda10.2 arm64 UFF converter for TensorRT package
I converted the fpenet.etlt using tao-converter and the following parameter:
export TRT_LIB_PATH=”/usr/lib/aarch64-linux-gnu”
export TRT_INC_PATH=”/usr/include/aarch64-linux-gnu”
export INPUT_DIMENSIONS=1x1x80x80
export ENCODE_KEY=nvidia_tlt
export BATCH_SIZE=1
export ENGINE_FILE_PATH=/home/e/Desktop/fpenet/fpenet_fp32.engine
export MAX_BATCH_SIZE=1
export OUTPUTS=output_bbox/BiasAdd,output_cov/Sigmoid
export DATA_TYPE=fp32
export MAX_WORKSPACE_SIZE=1610612736
export MODEL_IN=/home/e/Desktop/fpenet/fpenet.etlt
./tao-converter \
-d $INPUT_DIMENSIONS \
-k $ENCODE_KEY \
-b $BATCH_SIZE \
-e $ENGINE_FILE_PATH \
-m $MAX_BATCH_SIZE \
-o $OUTPUTS \
-t $DATA_TYPE \
-w $MAX_WORKSPACE_SIZE \
-p input_face_images,1x1x80x80,1x1x80x80,1x1x80x80 \
$MODEL_IN
I checked the converted model using trtexec and it passed the test as shown below:
./trtexec --verbose --minShapes=input:1x1x80x80 --optShapes=input:1x1x80x80 --maxShapes=input:2x1x80x80 --loadEngine=/home/e/Desktop/fpenet/fpenet_fp32.engine --batch=1
&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # ./trtexec --verbose --minShapes=input:1x1x80x80 --optShapes=input:1x1x80x80 --maxShapes=input:2x1x80x80 --loadEngine=/home/e/Desktop/fpenet/fpenet_fp32.engine --batch=1
[11/16/2023-06:46:55] [I] === Model Options ===
[11/16/2023-06:46:55] [I] Format: *
[11/16/2023-06:46:55] [I] Model:
[11/16/2023-06:46:55] [I] Output:
[11/16/2023-06:46:55] [I] === Build Options ===
[11/16/2023-06:46:55] [I] Max batch: explicit
[11/16/2023-06:46:55] [I] Workspace: 16 MiB
[11/16/2023-06:46:55] [I] minTiming: 1
[11/16/2023-06:46:55] [I] avgTiming: 8
[11/16/2023-06:46:55] [I] Precision: FP32
[11/16/2023-06:46:55] [I] Calibration:
[11/16/2023-06:46:55] [I] Refit: Disabled
[11/16/2023-06:46:55] [I] Sparsity: Disabled
[11/16/2023-06:46:55] [I] Safe mode: Disabled
[11/16/2023-06:46:55] [I] Restricted mode: Disabled
[11/16/2023-06:46:55] [I] Save engine:
[11/16/2023-06:46:55] [I] Load engine: /home/e/Desktop/fpenet/fpenet_fp32.engine
[11/16/2023-06:46:55] [I] NVTX verbosity: 0
[11/16/2023-06:46:55] [I] Tactic sources: Using default tactic sources
[11/16/2023-06:46:55] [I] timingCacheMode: local
[11/16/2023-06:46:55] [I] timingCacheFile:
[11/16/2023-06:46:55] [I] Input(s)s format: fp32:CHW
[11/16/2023-06:46:55] [I] Output(s)s format: fp32:CHW
[11/16/2023-06:46:55] [I] Input build shape: input=1x1x80x80+1x1x80x80+2x1x80x80
[11/16/2023-06:46:55] [I] Input calibration shapes: model
[11/16/2023-06:46:55] [I] === System Options ===
[11/16/2023-06:46:55] [I] Device: 0
[11/16/2023-06:46:55] [I] DLACore:
[11/16/2023-06:46:55] [I] Plugins:
[11/16/2023-06:46:55] [I] === Inference Options ===
[11/16/2023-06:46:55] [I] Batch: Explicit
[11/16/2023-06:46:55] [I] Input inference shape: input=1x1x80x80
[11/16/2023-06:46:55] [I] Iterations: 10
[11/16/2023-06:46:55] [I] Duration: 3s (+ 200ms warm up)
[11/16/2023-06:46:55] [I] Sleep time: 0ms
[11/16/2023-06:46:55] [I] Streams: 1
[11/16/2023-06:46:55] [I] ExposeDMA: Disabled
[11/16/2023-06:46:55] [I] Data transfers: Enabled
[11/16/2023-06:46:55] [I] Spin-wait: Disabled
[11/16/2023-06:46:55] [I] Multithreading: Disabled
[11/16/2023-06:46:55] [I] CUDA Graph: Disabled
[11/16/2023-06:46:55] [I] Separate profiling: Disabled
[11/16/2023-06:46:55] [I] Time Deserialize: Disabled
[11/16/2023-06:46:55] [I] Time Refit: Disabled
[11/16/2023-06:46:55] [I] Skip inference: Disabled
[11/16/2023-06:46:55] [I] Inputs:
[11/16/2023-06:46:55] [I] === Reporting Options ===
[11/16/2023-06:46:55] [I] Verbose: Enabled
[11/16/2023-06:46:55] [I] Averages: 10 inferences
[11/16/2023-06:46:55] [I] Percentile: 99
[11/16/2023-06:46:55] [I] Dump refittable layers:Disabled
[11/16/2023-06:46:55] [I] Dump output: Disabled
[11/16/2023-06:46:55] [I] Profile: Disabled
[11/16/2023-06:46:55] [I] Export timing to JSON file:
[11/16/2023-06:46:55] [I] Export output to JSON file:
[11/16/2023-06:46:55] [I] Export profile to JSON file:
[11/16/2023-06:46:55] [I]
[11/16/2023-06:46:55] [I] === Device Information ===
[11/16/2023-06:46:55] [I] Selected Device: NVIDIA Tegra X1
[11/16/2023-06:46:55] [I] Compute Capability: 5.3
[11/16/2023-06:46:55] [I] SMs: 1
[11/16/2023-06:46:55] [I] Compute Clock Rate: 0.9216 GHz
[11/16/2023-06:46:55] [I] Device Global Memory: 1978 MiB
[11/16/2023-06:46:55] [I] Shared Memory per SM: 64 KiB
[11/16/2023-06:46:55] [I] Memory Bus Width: 64 bits (ECC disabled)
[11/16/2023-06:46:55] [I] Memory Clock Rate: 0.01275 GHz
[11/16/2023-06:46:55] [I]
[11/16/2023-06:46:55] [I] TensorRT version: 8001
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::Region_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::Clip_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::LReLU_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::ScatterND version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::CropAndResize version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::EfficientNMS_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::Proposal version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::Split version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1
[11/16/2023-06:46:55] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1
[11/16/2023-06:47:05] [I] [TRT] [MemUsageChange] Init CUDA: CPU +203, GPU +0, now: CPU 226, GPU 1894 (MiB)
[11/16/2023-06:47:05] [I] [TRT] Loaded engine size: 4 MB
[11/16/2023-06:47:05] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 226 MiB, GPU 1896 MiB
[11/16/2023-06:47:05] [W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[11/16/2023-06:47:13] [V] [TRT] Using cublas a tactic source
[11/16/2023-06:47:14] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +158, GPU -8, now: CPU 384, GPU 1893 (MiB)
[11/16/2023-06:47:14] [V] [TRT] Using cuDNN as a tactic source
[11/16/2023-06:47:53] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +241, GPU +52, now: CPU 625, GPU 1945 (MiB)
[11/16/2023-06:48:04] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 624, GPU 1938 (MiB)
[11/16/2023-06:48:04] [V] [TRT] Deserialization required 59070374 microseconds.
[11/16/2023-06:48:04] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 624 MiB, GPU 1938 MiB
[11/16/2023-06:48:05] [I] Engine loaded in 69.7267 sec.
[11/16/2023-06:48:05] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 619 MiB, GPU 1937 MiB
[11/16/2023-06:48:05] [V] [TRT] Using cublas a tactic source
[11/16/2023-06:48:05] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU -6, now: CPU 620, GPU 1931 (MiB)
[11/16/2023-06:48:05] [V] [TRT] Using cuDNN as a tactic source
[11/16/2023-06:48:05] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +1, now: CPU 620, GPU 1932 (MiB)
[11/16/2023-06:48:05] [V] [TRT] Total per-runner device memory is 4277248
[11/16/2023-06:48:05] [V] [TRT] Total per-runner host memory is 28400
[11/16/2023-06:48:05] [V] [TRT] Allocated activation device memory of size 7917056
[11/16/2023-06:48:18] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 622 MiB, GPU 1903 MiB
[11/16/2023-06:48:19] [I] Created input binding for input_face_images with dimensions 1x1x80x80
[11/16/2023-06:48:19] [I] Created output binding for conv_keypoints_m80 with dimensions 1x80x80x80
[11/16/2023-06:48:19] [I] Created output binding for softargmax with dimensions 1x80x2
[11/16/2023-06:48:19] [I] Created output binding for softargmax:1 with dimensions 1x80
[11/16/2023-06:48:19] [I] Starting inference
[11/16/2023-06:50:50] [I] Warmup completed 1 queries over 200 ms
[11/16/2023-06:50:50] [I] Timing trace has 10 queries over 8.46087 s
[11/16/2023-06:50:50] [I]
[11/16/2023-06:50:50] [I] === Trace details ===
[11/16/2023-06:50:50] [I] Trace averages of 10 runs:
[11/16/2023-06:50:50] [I] Average on 10 runs - GPU latency: 669.589 ms - Host latency: 669.9 ms (end to end 669.975 ms, enqueue 643.141 ms)
[11/16/2023-06:50:50] [I]
[11/16/2023-06:50:50] [I] === Performance summary ===
[11/16/2023-06:50:50] [I] Throughput: 1.18191 qps
[11/16/2023-06:50:50] [I] Latency: min = 14.4062 ms, max = 6501.84 ms, mean = 669.9 ms, median = 20.4219 ms, percentile(99%) = 6501.84 ms
[11/16/2023-06:50:50] [I] End-to-End Host Latency: min = 14.4375 ms, max = 6502.11 ms, mean = 669.975 ms, median = 20.4375 ms, percentile(99%) = 6502.11 ms
[11/16/2023-06:50:50] [I] Enqueue Time: min = 2.25 ms, max = 6407.3 ms, mean = 643.141 ms, median = 2.64844 ms, percentile(99%) = 6407.3 ms
[11/16/2023-06:50:50] [I] H2D Latency: min = 0 ms, max = 0.171875 ms, mean = 0.0265625 ms, median = 0.015625 ms, percentile(99%) = 0.171875 ms
[11/16/2023-06:50:50] [I] GPU Compute Time: min = 14.2031 ms, max = 6501.42 ms, mean = 669.589 ms, median = 20.1406 ms, percentile(99%) = 6501.42 ms
[11/16/2023-06:50:50] [I] D2H Latency: min = 0.203125 ms, max = 0.390625 ms, mean = 0.284375 ms, median = 0.273438 ms, percentile(99%) = 0.390625 ms
[11/16/2023-06:50:50] [I] Total Host Walltime: 8.46087 s
[11/16/2023-06:50:50] [I] Total GPU Compute Time: 6.69589 s
[11/16/2023-06:50:50] [I] Explanations of the performance metrics are printed in the verbose logs.
[11/16/2023-06:50:50] [V]
[11/16/2023-06:50:50] [V] === Explanations of the performance metrics ===
[11/16/2023-06:50:50] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[11/16/2023-06:50:50] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[11/16/2023-06:50:50] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[11/16/2023-06:50:50] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[11/16/2023-06:50:50] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[11/16/2023-06:50:50] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[11/16/2023-06:50:50] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[11/16/2023-06:50:50] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[11/16/2023-06:50:50] [V] End-to-End Host Latency: the duration from when the H2D of a query is called to when the D2H of the same query is completed, which includes the latency to wait for the completion of the previous query. This is the latency of a query if multiple queries are enqueued consecutively.
[11/16/2023-06:50:50] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8001] # ./trtexec --verbose --minShapes=input:1x1x80x80 --optShapes=input:1x1x80x80 --maxShapes=input:2x1x80x80 --loadEngine=/home/e/Desktop/fpenet/fpenet_fp32.engine --batch=1
[11/16/2023-06:50:57] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 862, GPU 1946 (MiB)
If I inference the model with this test.py script with the script and pictures of a previous post( How to do inference with fpenet_fp32.trt ), I get an error.
This is the python code of the script:
import cv2
import numpy as np
import pycuda
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt
import time
from PIL import Image
class HostDeviceMem(object):
def __init__(self, host_mem, device_mem):
self.host = host_mem
self.device = device_mem
def __str__(self):
return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)
def __repr__(self):
return self.__str__()
class FpeNet(object):
def __init__(self, trt_path, input_size=(80, 80), batch_size=1):
self.trt_path = trt_path
self.input_size = input_size
self.batch_size = batch_size
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_runtime = trt.Runtime(TRT_LOGGER)
self.trt_engine = self._load_engine(trt_runtime, self.trt_path)
self.inputs, self.outputs, self.bindings, self.stream = \
self._allocate_buffers()
self.context = self.trt_engine.create_execution_context()
self.list_output = None
def _load_engine(self, trt_runtime, engine_path):
with open(engine_path, "rb") as f:
engine_data = f.read()
engine = trt_runtime.deserialize_cuda_engine(engine_data)
return engine
def _allocate_buffers(self):
inputs = []
outputs = []
bindings = []
stream = cuda.Stream()
binding_to_type = {
"input_face_images:0": np.float32,
"softargmax/strided_slice:0": np.float32,
"softargmax/strided_slice_1:0": np.float32
}
for binding in self.trt_engine:
size = trt.volume(self.trt_engine.get_binding_shape(binding)) \
* self.batch_size
dtype = binding_to_type[str(binding)]
host_mem = cuda.pagelocked_empty(size, dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes)
bindings.append(int(device_mem))
if self.trt_engine.binding_is_input(binding):
inputs.append(HostDeviceMem(host_mem, device_mem))
else:
outputs.append(HostDeviceMem(host_mem, device_mem))
return inputs, outputs, bindings, stream
def _do_inference(self, context, bindings, inputs,
outputs, stream):
[cuda.memcpy_htod_async(inp.device, inp.host, stream) \
for inp in inputs]
context.execute_async(
batch_size=self.batch_size, bindings=bindings,
stream_handle=stream.handle)
[cuda.memcpy_dtoh_async(out.host, out.device, stream) \
for out in outputs]
stream.synchronize()
return [out.host for out in outputs]
def _process_image(self, image):
image = cv2.imread(image)
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
w = self.input_size[0]
h = self.input_size[1]
print("w", w)
print("h", h)
self.image_height = image.shape[0]
self.image_width = image.shape[1]
image_resized = Image.fromarray(np.uint8(image))
image_resized = image_resized.resize(size=(w, h), resample=Image.BILINEAR)
img_np = np.array(image_resized)
img_np = img_np.astype(np.float32)
img_np = np.expand_dims(img_np, axis=0) # the shape would be 1x80x80
return img_np, image
def predict(self, img_path):
img_processed, image = self._process_image(img_path)
np.copyto(self.inputs[0].host, img_processed.ravel())
t_time = 0
landmarks = None
for i in range(1):
t1 = time.perf_counter()
landmarks, probs = self._do_inference(
self.context, bindings=self.bindings, inputs=self.inputs,
outputs=self.outputs, stream=self.stream)
t2 = time.perf_counter()
t_time += (t2 - t1)
print('inferece time:', t_time)
# to make (x, y)s from the (160, ) output
landmarks = landmarks.reshape(-1, 2)
visualized = self._visualize(image, landmarks)
return visualized
@staticmethod
def _postprocess(landmarks):
landmarks = landmarks.reshape(-1, 2)
return landmarks
def _visualize(self, frame, landmarks):
visualized = cv2.cvtColor(frame, cv2.COLOR_GRAY2BGR)
for x, y in landmarks:
x = x * self.image_width / self.input_size[0]
y = y * self.image_height / self.input_size[1]
x = int(x)
y = int(y)
cv2.circle(visualized, (x, y), 1, (0, 255, 0), 1)
return visualized
if __name__ == '__main__':
import argparse
arg_parser = argparse.ArgumentParser()
arg_parser.add_argument('--input', '-i', type=str, required=True)
args = arg_parser.parse_args()
img_path = args.input
fpenet_obj = FpeNet('/home/e/Desktop/fpenet/fpenet_fp32.engine')
output = fpenet_obj.predict(img_path)
cv2.imwrite('landmarks.jpg', output)
print('image has been writen to landmarks.jpg')
This is the exact error message output when I run the script:
python3 test.py --input test.png
[TensorRT] WARNING: Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
Traceback (most recent call last):
File "test.py", line 150, in <module>
fpenet_obj = FpeNet('/home/e/Desktop/fpenet/fpenet_fp32.engine')
File "test.py", line 35, in __init__
self._allocate_buffers()
File "test.py", line 61, in _allocate_buffers
dtype = binding_to_type[str(binding)]
KeyError: 'input_face_images'
[TensorRT] INTERNAL ERROR: [defaultAllocator.cpp::free::85] Error Code 1: Cuda Runtime (invalid argument)
Segmentation fault (core dumped)
Does anyone have any idea what went wrong ? I would be nice if someone could help me out here.
Kind regards
Emanuel