Tao-converted .plan model running in triton-server turned to bad accurate

Oh, very sorry for that, I have not tried your model yet.

As mentioned above, could you try to run standalone script as well?
See Inferring resnet18 classification etlt model with python - #9 by Morganh
and Inferring resnet18 classification etlt model with python - #40 by Morganh

I just need to docker exec -it xxxx bash into my triton docker (triton-apps:21.11-py3), and copy in the python samples from TensorRT Python Samples, and then run the inference against my classification model, correct? No need to clone the whole TensorRT source for build and install.

No, just login the tao docker and try to run below standalone python script. Not needed to copy other python samples or TRT source.
Inferring resnet18 classification etlt model with python - #9 by Morganh
or Inferring resnet18 classification etlt model with python - #40 by Morganh

my training machine and triton server are 2 differct standalone machines, the bad accurate happens in triton server.

those python scripts(Inferring resnet18 classification etlt model with python - #9 by Morganh) are suppose to run in my training machine? this is my docker ps in training machine:

CONTAINER ID   IMAGE                                                     COMMAND                  CREATED       STATUS       PORTS     NAMES
47f180137f6c   nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3   "install_ngc_cli.sh …"   13 days ago   Up 13 days             dazzling_carver
2abe608d35af   nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3   "install_ngc_cli.sh …"   13 days ago   Up 13 days             gallant_hawking
f3b53fb42965   nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3   "install_ngc_cli.sh …"   13 days ago   Up 13 days             reverent_dijkstra

after docker exec -it 47f180137f6c bash, there’s only bin folder there:

root@47f180137f6c:/usr/src/tensorrt# ls
bin

Above-mentioned standalone scripts can run on either of your machines. You can as below.
$ python xxx.py

The script will run inference against a tensorrt engine. The tensorrt engine is already generated by the triton server. You can directly login the triton server and then run standalone inference script.

I prefer to run it in triton server docker, this is the docker ps in my triton server:

CONTAINER ID   IMAGE                                      COMMAND                  CREATED      STATUS      PORTS                                                           NAMES
a014d704c8e9   nvcr.io/nvidia/tao/triton-apps:21.11-py3   "/opt/tritonserver/n…"   5 days ago   Up 5 days   0.0.0.0:8000-8002->8000-8002/tcp, :::8000-8002->8000-8002/tcp   stupefied_grothendieck

I entered the docker with sudo docker exec -it a014d704c8e9 bash, checked the:

  • /usr/src
    2 folders of cudnn_samples_v8 and tensorrt

  • /usr/src/tensorrt
    1 folder of bin

then manually copied in the single file of caffe_resnet50.py under /usr/src/tensorrt, run python3 caffe_resnet50.py, an error say:

Traceback (most recent call last):
File “caffe_resnet50.py”, line 26, in
import tensorrt as trt
ModuleNotFoundError: No module named ‘tensorrt’

ps: the post was using the trt file, while i only have the .etlt model.

In triton server, it already help you generate the trt engine. You can find your model (model.plan) . That is the tensorrt engine.

I’ve docker exec -it xxx bash into the docker instance of tao-toolkit-triton-apps, after installed bunch of dependencies: nvidia-tensorrt, opencv-python, libgl1, pycuda, pillow, I run the python script infer_cls.py with small modifications from here, the script now is:

import os
import time
import cv2
#import matplotlib.pyplot as plt
import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt
from PIL import Image

class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()

def load_engine(trt_runtime, engine_path):
    with open(engine_path, "rb") as f:
        engine_data = f.read()
    engine = trt_runtime.deserialize_cuda_engine(engine_data)
    return engine

def allocate_buffers(engine):
    # Determine dimensions and create page-locked memory buffers (i.e. won't be swapped to disk) to hold host inputs/outputs.
    h_input = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(trt.float32))
    h_output = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(1)), dtype=trt.nptype(trt.float32))
    # Allocate device memory for inputs and outputs.
    d_input = cuda.mem_alloc(h_input.nbytes)
    d_output = cuda.mem_alloc(h_output.nbytes)
    # Create a stream in which to copy inputs/outputs and run inference.
    stream = cuda.Stream()
    return h_input, d_input, h_output, d_output, stream

def load_normalized_test_case(test_image, pagelocked_buffer):
    # Converts the input image to a CHW Numpy array
    def normalize_image(image):
        # Resize, antialias and transpose the image to CHW.
        c, h, w = 3,120,120
        return np.asarray(image.resize((w, h), Image.ANTIALIAS)).transpose([2, 0, 1]).astype(trt.nptype(trt.float32)).ravel()

    # Normalize the image and copy to pagelocked memory.
    np.copyto(pagelocked_buffer, normalize_image(Image.open(test_image)))
    return test_image

def do_inference(context, h_input, d_input, h_output, d_output, stream):
    # Transfer input data to the GPU.
    cuda.memcpy_htod_async(d_input, h_input, stream)
    # Run inference.
    context.execute_async(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
    # Transfer predictions back from the GPU.
    cuda.memcpy_dtoh_async(h_output, d_output, stream)
    # Synchronize the stream
    stream.synchronize()
    return h_output,h_input

if __name__ == '__main__':    
    neg = 0
    pos = 0
    count = 0
    
    # TensorRT logger singleton
    TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
    trt_engine_path = os.path.join("electric_bicycle_net_tao/1/model.plan")
    if not os.path.exists(trt_engine_path) 
		print("the engine file does not exists, quit!")
		exit()
    trt_runtime = trt.Runtime(TRT_LOGGER)
    trt_engine = load_engine(trt_runtime, trt_engine_path)
    
    # This allocates memory for network inputs/outputs on both CPU and GPU
    h_input, d_input, h_output, d_output, stream = allocate_buffers(trt_engine)
    
    # Execution context is needed for inference
    context = trt_engine.create_execution_context()


    # -------------- MODEL PARAMETERS FOR THE MODEL --------------------------------
    model_h = 120
    model_w = 120
    img_dir = "data/"
    
    folders = os.listdir(img_dir)
    
    for sub_folder in folders:
        #loop over the folders
        images = os.listdir(img_dir + sub_folder)
        
        for i in images:
            #loop over the images
            
            test_image = img_dir + sub_folder + "/" + i
            
            labels_file = "electric_bicycle_net_tao/labels.txt"
            labels = open(labels_file, 'r').read().split('\n')
            
            test_case = load_normalized_test_case(test_image, h_input)
            
            start_time = time.time()
            h_output,h_input = do_inference(context, h_input, d_input, h_output, d_output, stream)
            pred = labels[np.argmax(h_output)]
            
            #print (test_image)
            print ("class: ",pred,", Confidence: ", max(h_output))
            print ("Inference Time : ",time.time()-start_time)
            
            if pred == "negative":
                neg +=1
            if pred == "positive":
                pos+=1
                print (test_image)
                #time.sleep(3)
                
            #time.sleep(1)
            count += 1
    
    print ("Total Number of items in the directory : ",count)
    print ("Total number of Positive Items : ",pos)
    print ("Total number of Negative Items : ",neg)

it shows an error:

root@9207ab950ed0:/opt/tritonserver/mytest# python3 infer_cls.py
[03/24/2022-03:17:41] [TRT] [E] 1: [stdArchiveReader.cpp::StdArchiveReader::35] Error Code 1: Serialization (Serialization assertion safeVersionRead == safeSerializationVersion failed.Version tag does not match. Note: Current Version: 0, Serialized Engine Version: 43)
[03/24/2022-03:17:41] [TRT] [E] 4: [runtime.cpp::deserializeCudaEngine::50] Error Code 4: Internal Error (Engine deserialization failed.)
Traceback (most recent call last):
File “infer_cls.py”, line 85, in
h_input, d_input, h_output, d_output, stream = allocate_buffers(trt_engine)
File “infer_cls.py”, line 34, in allocate_buffers
h_input = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(trt.float32))
AttributeError: ‘NoneType’ object has no attribute ‘get_binding_shape’

this is my folder structure:

root@9207ab950ed0:/opt/tritonserver/mytest# ls
data  electric_bicycle_net_tao  infer_cls.py
root@9207ab950ed0:/opt/tritonserver/mytest# ls electric_bicycle_net_tao/1/     
model.plan

could you help?

That means the existing tensort engine does not match your environment.
You can refer to below command to generate tensorrt engine again.

tao-converter /tao_models/vehicletypenet_model/resnet18_vehicletypenet_pruned.etlt
-k tlt_encode
-c /tao_models/vehicletypenet_model/vehicletypenet_int8.txt
-d 3,224,224
-o predictions/Softmax
-t int8
-m 16
-e /model_repository/vehicletypenet_tao/1/model.plan

I think the above model was actually generated from the repo of tao-toolkit-triton-apps.
As I cloned the tao-toolkit-triton-apps. and modified a bit in tao-toolkit-triton-apps/download_and_convert.sh:

echo "Converting the Electric_bicycle_net_tao model"
mkdir -p /model_repository/electric_bicycle_net_tao/1
tao-converter /tao_models/electric_bicycle_net_tao/final_model.etlt \
              -k nvidia_tlt \
              -d 3,224,224 \
              -o predictions/Softmax \
              -m 16 \
              -e /model_repository/electric_bicycle_net_tao/1/model.plan

At the starting of the triton-server, I can see the model was correctly converted:


...
...
...
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2565, GPU 1702 (MiB)
[INFO] [MemUsageSnapshot] Builder end: CPU 2546 MiB, GPU 1702 MiB
Converting the Electric_bicycle_net_tao model
[INFO] [MemUsageChange] Init CUDA: CPU +534, GPU +0, now: CPU 540, GPU 560 (MiB)
[INFO] [MemUsageSnapshot] Builder begin: CPU 629 MiB, GPU 560 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +791, GPU +340, now: CPU 1464, GPU 900 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +195, GPU +342, now: CPU 1659, GPU 1242 (MiB)
[WARNING] Detected invalid timing cache, setup a local cache instead
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[INFO] Detected 1 inputs and 1 output network tensors.
[INFO] Total Host Persistent Memory: 94864
[INFO] Total Device Persistent Memory: 46283264
[INFO] Total Scratch Memory: 0
[INFO] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 53 MiB, GPU 32 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2634, GPU 1768 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2634, GPU 1776 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2634, GPU 1760 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2634, GPU 1742 (MiB)
[INFO] [MemUsageSnapshot] Builder end: CPU 2634 MiB, GPU 1742 MiB
I0323 11:48:38.459937 64 metrics.cc:298] Collecting metrics for GPU 0: NVIDIA GeForce RTX 3060
I0323 11:48:38.625626 64 libtorch.cc:1092] TRITONBACKEND_Initialize: pytorch
I0323 11:48:38.625643 64 libtorch.cc:1102] Triton TRITONBACKEND API version: 1.6
I0323 11:48:38.625646 64 libtorch.cc:1108] 'pytorch' TRITONBACKEND API version: 1.6
2022-03-23 11:48:38.818313: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0323 11:48:38.844581 64 tensorflow.cc:2170] TRITONBACKEND_Initialize: tensorflow
I0323 11:48:38.844603 64 tensorflow.cc:2180] Triton TRITONBACKEND API version: 1.6
I0323 11:48:38.844606 64 tensorflow.cc:2186] 'tensorflow' TRITONBACKEND API version: 1.6
I0323 11:48:38.844609 64 tensorflow.cc:2210] backend configuration:
{}
I0323 11:48:38.845561 64 onnxruntime.cc:1999] TRITONBACKEND_Initialize: onnxruntime
I0323 11:48:38.845572 64 onnxruntime.cc:2009] Triton TRITONBACKEND API version: 1.6
I0323 11:48:38.845575 64 onnxruntime.cc:2015] 'onnxruntime' TRITONBACKEND API version: 1.6
I0323 11:48:38.876770 64 openvino.cc:1193] TRITONBACKEND_Initialize: openvino
I0323 11:48:38.876789 64 openvino.cc:1203] Triton TRITONBACKEND API version: 1.6
I0323 11:48:38.876792 64 openvino.cc:1209] 'openvino' TRITONBACKEND API version: 1.6
I0323 11:48:39.002462 64 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fd9e6000000' with size 268435456
I0323 11:48:39.002599 64 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0323 11:48:39.003531 64 model_repository_manager.cc:1022] loading: vehicletypenet_tao:1
I0323 11:48:39.103978 64 model_repository_manager.cc:1022] loading: electric_bicycle_net_tao:1
I0323 11:48:39.133371 64 tensorrt.cc:4925] TRITONBACKEND_Initialize: tensorrt
I0323 11:48:39.133394 64 tensorrt.cc:4935] Triton TRITONBACKEND API version: 1.6
I0323 11:48:39.133398 64 tensorrt.cc:4941] 'tensorrt' TRITONBACKEND API version: 1.6
I0323 11:48:39.133477 64 tensorrt.cc:4984] backend configuration:
{}
I0323 11:48:39.133680 64 tensorrt.cc:5036] TRITONBACKEND_ModelInitialize: vehicletypenet_tao (version 1)
I0323 11:48:39.135022 64 tensorrt.cc:5085] TRITONBACKEND_ModelInstanceInitialize: vehicletypenet_tao (GPU device 0)
I0323 11:48:39.509504 64 logging.cc:49] [MemUsageChange] Init CUDA: CPU +525, GPU +0, now: CPU 648, GPU 624 (MiB)
I0323 11:48:39.513867 64 logging.cc:49] Loaded engine size: 5 MB
I0323 11:48:39.513959 64 logging.cc:49] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 659 MiB, GPU 624 MiB
I0323 11:48:40.022610 64 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +792, GPU +340, now: CPU 1451, GPU 970 (MiB)
I0323 11:48:40.441824 64 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +195, GPU +336, now: CPU 1646, GPU 1306 (MiB)
I0323 11:48:40.442715 64 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1646, GPU 1288 (MiB)
I0323 11:48:40.442756 64 logging.cc:49] [MemUsageSnapshot] deserializeCudaEngine end: CPU 1646 MiB, GPU 1288 MiB
I0323 11:48:40.442897 64 logging.cc:49] [MemUsageSnapshot] ExecutionContext creation begin: CPU 1635 MiB, GPU 1288 MiB
I0323 11:48:40.443158 64 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 1635, GPU 1298 (MiB)
I0323 11:48:40.443830 64 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1635, GPU 1306 (MiB)
I0323 11:48:40.444192 64 logging.cc:49] [MemUsageSnapshot] ExecutionContext creation end: CPU 1635 MiB, GPU 1326 MiB
I0323 11:48:40.444274 64 tensorrt.cc:1379] Created instance vehicletypenet_tao on GPU 0 with stream priority 0
I0323 11:48:40.444292 64 tensorrt.cc:5036] TRITONBACKEND_ModelInitialize: electric_bicycle_net_tao (version 1)
I0323 11:48:40.444393 64 model_repository_manager.cc:1183] successfully loaded 'vehicletypenet_tao' version 1
I0323 11:48:40.445171 64 tensorrt.cc:5085] TRITONBACKEND_ModelInstanceInitialize: electric_bicycle_net_tao (GPU device 0)
I0323 11:48:40.445384 64 logging.cc:49] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1635, GPU 1326 (MiB)
I0323 11:48:40.475368 64 logging.cc:49] Loaded engine size: 44 MB
I0323 11:48:40.475466 64 logging.cc:49] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 1724 MiB, GPU 1326 MiB
I0323 11:48:40.569055 64 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1724, GPU 1386 (MiB)
I0323 11:48:40.569464 64 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 1724, GPU 1396 (MiB)
I0323 11:48:40.569926 64 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1724, GPU 1380 (MiB)
I0323 11:48:40.569967 64 logging.cc:49] [MemUsageSnapshot] deserializeCudaEngine end: CPU 1724 MiB, GPU 1380 MiB
I0323 11:48:40.572716 64 logging.cc:49] [MemUsageSnapshot] ExecutionContext creation begin: CPU 1635 MiB, GPU 1380 MiB
I0323 11:48:40.572975 64 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 1636, GPU 1388 (MiB)
I0323 11:48:40.573242 64 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1636, GPU 1396 (MiB)
I0323 11:48:40.573912 64 logging.cc:49] [MemUsageSnapshot] ExecutionContext creation end: CPU 1636 MiB, GPU 1516 MiB
I0323 11:48:40.573991 64 tensorrt.cc:1379] Created instance electric_bicycle_net_tao on GPU 0 with stream priority 0
I0323 11:48:40.574091 64 model_repository_manager.cc:1183] successfully loaded 'electric_bicycle_net_tao' version 1
I0323 11:48:40.574142 64 server.cc:522] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0323 11:48:40.574177 64 server.cc:549] 
+-------------+-----------------------------------------------------------------+--------+
| Backend     | Path                                                            | Config |
+-------------+-----------------------------------------------------------------+--------+
| pytorch     | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so         | {}     |
| tensorflow  | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so | {}     |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {}     |
| openvino    | /opt/tritonserver/backends/openvino/libtriton_openvino.so       | {}     |
| tensorrt    | /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so       | {}     |
+-------------+-----------------------------------------------------------------+--------+

I0323 11:48:40.574200 64 server.cc:592] 
+--------------------------+---------+--------+
| Model                    | Version | Status |
+--------------------------+---------+--------+
| electric_bicycle_net_tao | 1       | READY  |
| vehicletypenet_tao       | 1       | READY  |
+--------------------------+---------+--------+
...
...
...

and the converted model is the one I used in the python test script.

Actually the triton server is generating tensort engine in a new docker , see

https://github.com/NVIDIA-AI-IOT/tao-toolkit-triton-apps/blob/main/scripts/config.sh#L26
https://github.com/NVIDIA-AI-IOT/tao-toolkit-triton-apps/blob/main/scripts/config.sh#L27
https://github.com/NVIDIA-AI-IOT/tao-toolkit-triton-apps/blob/main/scripts/start_server.sh#L59
https://github.com/NVIDIA-AI-IOT/tao-toolkit-triton-apps/blob/main/scripts/start_server.sh#L97

According to above, can you check “$ docker images” and login in that model to generate tensorrt engine?

My docker ps, show only 1 instance there, and all above operations (run python script to start the classification inference, and got errors) are in it:

sudo docker ps
CONTAINER ID   IMAGE                                      COMMAND                  CREATED        STATUS        PORTS                                                           NAMES
9207ab950ed0   nvcr.io/nvidia/tao/triton-apps:21.11-py3   "/opt/tritonserver/n…"   16 hours ago   Up 16 hours   0.0.0.0:8000-8002->8000-8002/tcp, :::8000-8002->8000-8002/tcp   brave_sammet

My docker images:

REPOSITORY                       TAG             IMAGE ID       CREATED         SIZE
nvcr.io/nvidia/tao/triton-apps   21.11-py3       f33160171d35   12 days ago     13.8GB
nvcr.io/nvidia/tritonserver      22.02-py3       d52ac03519ab   4 weeks ago     12.3GB
nvcr.io/nvidia/tritonserver      22.02-py3-sdk   2ae0b4a9046e   4 weeks ago     11.5GB
nvcr.io/nvidia/tritonserver      21.12-py3       d22d6cbf54b7   3 months ago    12.2GB
nvcr.io/nvidia/tritonserver      21.12-py3-sdk   2846ae651342   3 months ago    11.2GB
nvcr.io/nvidia/tritonserver      21.10-py3       5c99e9b6586e   5 months ago    13.7GB
nvidia/cuda                      11.0-base       2ec708416bb8   19 months ago   122MB

i just did a manual tao-converter at the docker instance by:

root@**9207ab950ed0**:/opt/tritonserver# tao-converter /tao_models/electric_bicycle_net_tao/final_model.etlt \
>               -k nvidia_tlt \
>               -d 3,224,224 \
>               -o predictions/Softmax \
>               -m 16 \
>               -e electric_bicycle_net_tao/new_generated_electric_bicycle_net_tao_model.plan

and still see the same error in:


trt_engine_path = os.path.join("electric_bicycle_net_tao/new_generated_electric_bicycle_net_tao_model.plan")
trt_engine = load_engine(trt_runtime, trt_engine_path)

OK, to narrow down, can you change the saving path "-e " and filename and retry?

-e /opt/tritonserver/xxx.engine

still the same error, actually there’s one line of code to make sure .engine file does exists:

if not os.path.exists(trt_engine_path):
        print("the engine file does not exists, quit!")
        exit()

and this was never hit in above experiments.

this is what I’ve done for this time to export a new name: abc.engine of engine file:

tao-converter /tao_models/electric_bicycle_net_tao/final_model.etlt               -k nvidia_tlt               -d 3,224,224               -o predictions/Softmax               -m 16               -e /opt/tritonserver/abc.engine                     
[INFO] [MemUsageChange] Init CUDA: CPU +534, GPU +0, now: CPU 540, GPU 1827 (MiB)
[INFO] [MemUsageSnapshot] Builder begin: CPU 629 MiB, GPU 1827 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +791, GPU +340, now: CPU 1464, GPU 2167 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +195, GPU +342, now: CPU 1659, GPU 2509 (MiB)
[WARNING] Detected invalid timing cache, setup a local cache instead
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[INFO] Detected 1 inputs and 1 output network tensors.
[INFO] Total Host Persistent Memory: 94352
[INFO] Total Device Persistent Memory: 46283264
[INFO] Total Scratch Memory: 0
[INFO] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 53 MiB, GPU 32 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2634, GPU 3035 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2634, GPU 3043 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2634, GPU 3027 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2634, GPU 3009 (MiB)
[INFO] [MemUsageSnapshot] Builder end: CPU 2634 MiB, GPU 3009 MiB
root@9207ab950ed0:/opt/tritonserver/mytest# mv ../abc.engine ./
root@9207ab950ed0:/opt/tritonserver/mytest# python3 infer_cls.py 
[03/24/2022-04:10:36] [TRT] [E] 1: [stdArchiveReader.cpp::StdArchiveReader::35] Error Code 1: Serialization (Serialization assertion safeVersionRead == safeSerializationVersion failed.Version tag does not match. Note: Current Version: 0, Serialized Engine Version: 43)
[03/24/2022-04:10:36] [TRT] [E] 4: [runtime.cpp::deserializeCudaEngine::50] Error Code 4: Internal Error (Engine deserialization failed.)
Traceback (most recent call last):
  File "infer_cls.py", line 86, in <module>
    h_input, d_input, h_output, d_output, stream = allocate_buffers(trt_engine)
  File "infer_cls.py", line 34, in allocate_buffers
    h_input = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(trt.float32))
AttributeError: 'NoneType' object has no attribute 'get_binding_shape'
root@9207ab950ed0:/opt/tritonserver/mytest#

I installed the python tensorrt package in tao-toolkit-triton-apps docker instance via:
python3 -m pip install --upgrade nvidia-tensorrt
is that possible cause the issue? I mean do I need specify the version number or sth else in the command?

The tensorrt version should the same when you generate trt engine and run inference with this trt engine.

I think I’ve showed the steps, and the tao-converter and inferencing are all in the same docker instance, what else could be check why the python script still show error.

To narrow down, can you let triton server generate vehicletypenet model and check if you can load that model.plan with your code?

just tried, still the same error by even load the vehicletypenet_tao/1/model.plan.

but if i use the client to infer with my model, it looks good (though bad accurate).

Hi Morgan, Today I used another Ubuntu 20, x64, RTX3090 machine to load the tao-toolkit-triton-apps docker again but just with a bit changes of ommit some models downloading and convert, as the downloading here cost lots of time and sometimes stuck.
The modified repo is from here. you can review that 2 commit.
This is the docker ps:

CONTAINER ID   IMAGE                                                     COMMAND                  CREATED          STATUS          PORTS                                                           NAMES
b2ba308f296a   nvcr.io/nvidia/tao/triton-apps:21.11-py3                  "/opt/tritonserver/n…"   36 minutes ago   Up 36 minutes   0.0.0

and this is the triton server console output:

...
...
...
+-------------+------------------------------------------------------+--------+

I0325 03:32:49.073422 57 server.cc:592] 
+--------------------+---------+--------+
| Model              | Version | Status |
+--------------------+---------+--------+
| vehicletypenet_tao | 1       | READY  |
+--------------------+---------+--------+

I0325 03:32:49.073464 57 tritonserver.cc:1920] 
+----------------------------------+------------------------------------------+
| Option                           | Value                                    |
+----------------------------------+------------------------------------------+
| server_id                        | triton                                   |
| server_version                   | 2.15.0                                   |
| server_extensions                | classification sequence model_repository |
|                                  |  model_repository(unload_dependents) sch |
|                                  | edule_policy model_configuration system_ |
|                                  | shared_memory cuda_shared_memory binary_ |
|                                  | tensor_data statistics                   |
| model_repository_path[0]         | /model_repository                        |
| model_control_mode               | MODE_NONE                                |
| strict_model_config              | 1                                        |
| rate_limit                       | OFF                                      |
| pinned_memory_pool_byte_size     | 268435456                                |
| cuda_memory_pool_byte_size{0}    | 67108864                                 |
| response_cache_byte_size         | 0                                        |
| min_supported_compute_capability | 6.0                                      |
| strict_readiness                 | 1                                        |
| exit_timeout                     | 30                                       |
+----------------------------------+------------------------------------------+

I0325 03:32:49.074034 57 grpc_server.cc:4117] Started GRPCInferenceService at 0.0.0.0:8001
I0325 03:32:49.074160 57 http_server.cc:2815] Started HTTPService at 0.0.0.0:8000
I0325 03:32:49.115288 57 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002

After docker exec -it xxxx bash into above docker instance, I installed these dependencies to start running the inferernce python app:

# python3 -m pip install --upgrade setuptools pip
# python3 -m pip install nvidia-pyindex
# python3 -m pip install --upgrade nvidia-tensorrt
# pip3 install opencv-python
# apt-get update && apt-get install libgl1
# python3 -m pip install numpy
# python3 -m pip install 'pycuda<2021.1'
# pip3 install pillow

still I was trying to load the vehicletypenet_tao/1/model.plan which was generated by docker itself via build-in tao-converter, but the same error still show:

[03/25/2022-03:58:28] [TRT] [E] 1: [stdArchiveReader.cpp::StdArchiveReader::35] Error Code 1: Serialization (Serialization assertion safeVersionRead == safeSerializationVersion failed.Version tag does not match. Note: Current Version: 0, Serialized Engine Version: 43)
[03/25/2022-03:58:28] [TRT] [E] 4: [runtime.cpp::deserializeCudaEngine::50] Error Code 4: Internal Error (Engine deserialization failed.)
Traceback (most recent call last):
  File "load_model_and_infer.py", line 98, in <module>
    h_input, d_input, h_output, d_output, stream = allocate_buffers(trt_engine)
  File "load_model_and_infer.py", line 47, in allocate_buffers
    h_input = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(trt.float32))
AttributeError: 'NoneType' object has no attribute 'get_binding_shape'