Tao-converted .plan model running in triton-server turned to bad accurate

In triton server, it already help you generate the trt engine. You can find your model (model.plan) . That is the tensorrt engine.

I’ve docker exec -it xxx bash into the docker instance of tao-toolkit-triton-apps, after installed bunch of dependencies: nvidia-tensorrt, opencv-python, libgl1, pycuda, pillow, I run the python script infer_cls.py with small modifications from here, the script now is:

import os
import time
import cv2
#import matplotlib.pyplot as plt
import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt
from PIL import Image

class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()

def load_engine(trt_runtime, engine_path):
    with open(engine_path, "rb") as f:
        engine_data = f.read()
    engine = trt_runtime.deserialize_cuda_engine(engine_data)
    return engine

def allocate_buffers(engine):
    # Determine dimensions and create page-locked memory buffers (i.e. won't be swapped to disk) to hold host inputs/outputs.
    h_input = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(trt.float32))
    h_output = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(1)), dtype=trt.nptype(trt.float32))
    # Allocate device memory for inputs and outputs.
    d_input = cuda.mem_alloc(h_input.nbytes)
    d_output = cuda.mem_alloc(h_output.nbytes)
    # Create a stream in which to copy inputs/outputs and run inference.
    stream = cuda.Stream()
    return h_input, d_input, h_output, d_output, stream

def load_normalized_test_case(test_image, pagelocked_buffer):
    # Converts the input image to a CHW Numpy array
    def normalize_image(image):
        # Resize, antialias and transpose the image to CHW.
        c, h, w = 3,120,120
        return np.asarray(image.resize((w, h), Image.ANTIALIAS)).transpose([2, 0, 1]).astype(trt.nptype(trt.float32)).ravel()

    # Normalize the image and copy to pagelocked memory.
    np.copyto(pagelocked_buffer, normalize_image(Image.open(test_image)))
    return test_image

def do_inference(context, h_input, d_input, h_output, d_output, stream):
    # Transfer input data to the GPU.
    cuda.memcpy_htod_async(d_input, h_input, stream)
    # Run inference.
    context.execute_async(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
    # Transfer predictions back from the GPU.
    cuda.memcpy_dtoh_async(h_output, d_output, stream)
    # Synchronize the stream
    stream.synchronize()
    return h_output,h_input

if __name__ == '__main__':    
    neg = 0
    pos = 0
    count = 0
    
    # TensorRT logger singleton
    TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
    trt_engine_path = os.path.join("electric_bicycle_net_tao/1/model.plan")
    if not os.path.exists(trt_engine_path) 
		print("the engine file does not exists, quit!")
		exit()
    trt_runtime = trt.Runtime(TRT_LOGGER)
    trt_engine = load_engine(trt_runtime, trt_engine_path)
    
    # This allocates memory for network inputs/outputs on both CPU and GPU
    h_input, d_input, h_output, d_output, stream = allocate_buffers(trt_engine)
    
    # Execution context is needed for inference
    context = trt_engine.create_execution_context()


    # -------------- MODEL PARAMETERS FOR THE MODEL --------------------------------
    model_h = 120
    model_w = 120
    img_dir = "data/"
    
    folders = os.listdir(img_dir)
    
    for sub_folder in folders:
        #loop over the folders
        images = os.listdir(img_dir + sub_folder)
        
        for i in images:
            #loop over the images
            
            test_image = img_dir + sub_folder + "/" + i
            
            labels_file = "electric_bicycle_net_tao/labels.txt"
            labels = open(labels_file, 'r').read().split('\n')
            
            test_case = load_normalized_test_case(test_image, h_input)
            
            start_time = time.time()
            h_output,h_input = do_inference(context, h_input, d_input, h_output, d_output, stream)
            pred = labels[np.argmax(h_output)]
            
            #print (test_image)
            print ("class: ",pred,", Confidence: ", max(h_output))
            print ("Inference Time : ",time.time()-start_time)
            
            if pred == "negative":
                neg +=1
            if pred == "positive":
                pos+=1
                print (test_image)
                #time.sleep(3)
                
            #time.sleep(1)
            count += 1
    
    print ("Total Number of items in the directory : ",count)
    print ("Total number of Positive Items : ",pos)
    print ("Total number of Negative Items : ",neg)

it shows an error:

root@9207ab950ed0:/opt/tritonserver/mytest# python3 infer_cls.py
[03/24/2022-03:17:41] [TRT] [E] 1: [stdArchiveReader.cpp::StdArchiveReader::35] Error Code 1: Serialization (Serialization assertion safeVersionRead == safeSerializationVersion failed.Version tag does not match. Note: Current Version: 0, Serialized Engine Version: 43)
[03/24/2022-03:17:41] [TRT] [E] 4: [runtime.cpp::deserializeCudaEngine::50] Error Code 4: Internal Error (Engine deserialization failed.)
Traceback (most recent call last):
File “infer_cls.py”, line 85, in
h_input, d_input, h_output, d_output, stream = allocate_buffers(trt_engine)
File “infer_cls.py”, line 34, in allocate_buffers
h_input = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(trt.float32))
AttributeError: ‘NoneType’ object has no attribute ‘get_binding_shape’

this is my folder structure:

root@9207ab950ed0:/opt/tritonserver/mytest# ls
data  electric_bicycle_net_tao  infer_cls.py
root@9207ab950ed0:/opt/tritonserver/mytest# ls electric_bicycle_net_tao/1/     
model.plan

could you help?

That means the existing tensort engine does not match your environment.
You can refer to below command to generate tensorrt engine again.

tao-converter /tao_models/vehicletypenet_model/resnet18_vehicletypenet_pruned.etlt
-k tlt_encode
-c /tao_models/vehicletypenet_model/vehicletypenet_int8.txt
-d 3,224,224
-o predictions/Softmax
-t int8
-m 16
-e /model_repository/vehicletypenet_tao/1/model.plan

I think the above model was actually generated from the repo of tao-toolkit-triton-apps.
As I cloned the tao-toolkit-triton-apps. and modified a bit in tao-toolkit-triton-apps/download_and_convert.sh:

echo "Converting the Electric_bicycle_net_tao model"
mkdir -p /model_repository/electric_bicycle_net_tao/1
tao-converter /tao_models/electric_bicycle_net_tao/final_model.etlt \
              -k nvidia_tlt \
              -d 3,224,224 \
              -o predictions/Softmax \
              -m 16 \
              -e /model_repository/electric_bicycle_net_tao/1/model.plan

At the starting of the triton-server, I can see the model was correctly converted:


...
...
...
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2565, GPU 1702 (MiB)
[INFO] [MemUsageSnapshot] Builder end: CPU 2546 MiB, GPU 1702 MiB
Converting the Electric_bicycle_net_tao model
[INFO] [MemUsageChange] Init CUDA: CPU +534, GPU +0, now: CPU 540, GPU 560 (MiB)
[INFO] [MemUsageSnapshot] Builder begin: CPU 629 MiB, GPU 560 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +791, GPU +340, now: CPU 1464, GPU 900 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +195, GPU +342, now: CPU 1659, GPU 1242 (MiB)
[WARNING] Detected invalid timing cache, setup a local cache instead
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[INFO] Detected 1 inputs and 1 output network tensors.
[INFO] Total Host Persistent Memory: 94864
[INFO] Total Device Persistent Memory: 46283264
[INFO] Total Scratch Memory: 0
[INFO] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 53 MiB, GPU 32 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2634, GPU 1768 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2634, GPU 1776 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2634, GPU 1760 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2634, GPU 1742 (MiB)
[INFO] [MemUsageSnapshot] Builder end: CPU 2634 MiB, GPU 1742 MiB
I0323 11:48:38.459937 64 metrics.cc:298] Collecting metrics for GPU 0: NVIDIA GeForce RTX 3060
I0323 11:48:38.625626 64 libtorch.cc:1092] TRITONBACKEND_Initialize: pytorch
I0323 11:48:38.625643 64 libtorch.cc:1102] Triton TRITONBACKEND API version: 1.6
I0323 11:48:38.625646 64 libtorch.cc:1108] 'pytorch' TRITONBACKEND API version: 1.6
2022-03-23 11:48:38.818313: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0323 11:48:38.844581 64 tensorflow.cc:2170] TRITONBACKEND_Initialize: tensorflow
I0323 11:48:38.844603 64 tensorflow.cc:2180] Triton TRITONBACKEND API version: 1.6
I0323 11:48:38.844606 64 tensorflow.cc:2186] 'tensorflow' TRITONBACKEND API version: 1.6
I0323 11:48:38.844609 64 tensorflow.cc:2210] backend configuration:
{}
I0323 11:48:38.845561 64 onnxruntime.cc:1999] TRITONBACKEND_Initialize: onnxruntime
I0323 11:48:38.845572 64 onnxruntime.cc:2009] Triton TRITONBACKEND API version: 1.6
I0323 11:48:38.845575 64 onnxruntime.cc:2015] 'onnxruntime' TRITONBACKEND API version: 1.6
I0323 11:48:38.876770 64 openvino.cc:1193] TRITONBACKEND_Initialize: openvino
I0323 11:48:38.876789 64 openvino.cc:1203] Triton TRITONBACKEND API version: 1.6
I0323 11:48:38.876792 64 openvino.cc:1209] 'openvino' TRITONBACKEND API version: 1.6
I0323 11:48:39.002462 64 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fd9e6000000' with size 268435456
I0323 11:48:39.002599 64 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0323 11:48:39.003531 64 model_repository_manager.cc:1022] loading: vehicletypenet_tao:1
I0323 11:48:39.103978 64 model_repository_manager.cc:1022] loading: electric_bicycle_net_tao:1
I0323 11:48:39.133371 64 tensorrt.cc:4925] TRITONBACKEND_Initialize: tensorrt
I0323 11:48:39.133394 64 tensorrt.cc:4935] Triton TRITONBACKEND API version: 1.6
I0323 11:48:39.133398 64 tensorrt.cc:4941] 'tensorrt' TRITONBACKEND API version: 1.6
I0323 11:48:39.133477 64 tensorrt.cc:4984] backend configuration:
{}
I0323 11:48:39.133680 64 tensorrt.cc:5036] TRITONBACKEND_ModelInitialize: vehicletypenet_tao (version 1)
I0323 11:48:39.135022 64 tensorrt.cc:5085] TRITONBACKEND_ModelInstanceInitialize: vehicletypenet_tao (GPU device 0)
I0323 11:48:39.509504 64 logging.cc:49] [MemUsageChange] Init CUDA: CPU +525, GPU +0, now: CPU 648, GPU 624 (MiB)
I0323 11:48:39.513867 64 logging.cc:49] Loaded engine size: 5 MB
I0323 11:48:39.513959 64 logging.cc:49] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 659 MiB, GPU 624 MiB
I0323 11:48:40.022610 64 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +792, GPU +340, now: CPU 1451, GPU 970 (MiB)
I0323 11:48:40.441824 64 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +195, GPU +336, now: CPU 1646, GPU 1306 (MiB)
I0323 11:48:40.442715 64 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1646, GPU 1288 (MiB)
I0323 11:48:40.442756 64 logging.cc:49] [MemUsageSnapshot] deserializeCudaEngine end: CPU 1646 MiB, GPU 1288 MiB
I0323 11:48:40.442897 64 logging.cc:49] [MemUsageSnapshot] ExecutionContext creation begin: CPU 1635 MiB, GPU 1288 MiB
I0323 11:48:40.443158 64 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 1635, GPU 1298 (MiB)
I0323 11:48:40.443830 64 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1635, GPU 1306 (MiB)
I0323 11:48:40.444192 64 logging.cc:49] [MemUsageSnapshot] ExecutionContext creation end: CPU 1635 MiB, GPU 1326 MiB
I0323 11:48:40.444274 64 tensorrt.cc:1379] Created instance vehicletypenet_tao on GPU 0 with stream priority 0
I0323 11:48:40.444292 64 tensorrt.cc:5036] TRITONBACKEND_ModelInitialize: electric_bicycle_net_tao (version 1)
I0323 11:48:40.444393 64 model_repository_manager.cc:1183] successfully loaded 'vehicletypenet_tao' version 1
I0323 11:48:40.445171 64 tensorrt.cc:5085] TRITONBACKEND_ModelInstanceInitialize: electric_bicycle_net_tao (GPU device 0)
I0323 11:48:40.445384 64 logging.cc:49] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1635, GPU 1326 (MiB)
I0323 11:48:40.475368 64 logging.cc:49] Loaded engine size: 44 MB
I0323 11:48:40.475466 64 logging.cc:49] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 1724 MiB, GPU 1326 MiB
I0323 11:48:40.569055 64 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1724, GPU 1386 (MiB)
I0323 11:48:40.569464 64 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 1724, GPU 1396 (MiB)
I0323 11:48:40.569926 64 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1724, GPU 1380 (MiB)
I0323 11:48:40.569967 64 logging.cc:49] [MemUsageSnapshot] deserializeCudaEngine end: CPU 1724 MiB, GPU 1380 MiB
I0323 11:48:40.572716 64 logging.cc:49] [MemUsageSnapshot] ExecutionContext creation begin: CPU 1635 MiB, GPU 1380 MiB
I0323 11:48:40.572975 64 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 1636, GPU 1388 (MiB)
I0323 11:48:40.573242 64 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1636, GPU 1396 (MiB)
I0323 11:48:40.573912 64 logging.cc:49] [MemUsageSnapshot] ExecutionContext creation end: CPU 1636 MiB, GPU 1516 MiB
I0323 11:48:40.573991 64 tensorrt.cc:1379] Created instance electric_bicycle_net_tao on GPU 0 with stream priority 0
I0323 11:48:40.574091 64 model_repository_manager.cc:1183] successfully loaded 'electric_bicycle_net_tao' version 1
I0323 11:48:40.574142 64 server.cc:522] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0323 11:48:40.574177 64 server.cc:549] 
+-------------+-----------------------------------------------------------------+--------+
| Backend     | Path                                                            | Config |
+-------------+-----------------------------------------------------------------+--------+
| pytorch     | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so         | {}     |
| tensorflow  | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so | {}     |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {}     |
| openvino    | /opt/tritonserver/backends/openvino/libtriton_openvino.so       | {}     |
| tensorrt    | /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so       | {}     |
+-------------+-----------------------------------------------------------------+--------+

I0323 11:48:40.574200 64 server.cc:592] 
+--------------------------+---------+--------+
| Model                    | Version | Status |
+--------------------------+---------+--------+
| electric_bicycle_net_tao | 1       | READY  |
| vehicletypenet_tao       | 1       | READY  |
+--------------------------+---------+--------+
...
...
...

and the converted model is the one I used in the python test script.

Actually the triton server is generating tensort engine in a new docker , see

https://github.com/NVIDIA-AI-IOT/tao-toolkit-triton-apps/blob/main/scripts/config.sh#L26
https://github.com/NVIDIA-AI-IOT/tao-toolkit-triton-apps/blob/main/scripts/config.sh#L27
https://github.com/NVIDIA-AI-IOT/tao-toolkit-triton-apps/blob/main/scripts/start_server.sh#L59
https://github.com/NVIDIA-AI-IOT/tao-toolkit-triton-apps/blob/main/scripts/start_server.sh#L97

According to above, can you check “$ docker images” and login in that model to generate tensorrt engine?

My docker ps, show only 1 instance there, and all above operations (run python script to start the classification inference, and got errors) are in it:

sudo docker ps
CONTAINER ID   IMAGE                                      COMMAND                  CREATED        STATUS        PORTS                                                           NAMES
9207ab950ed0   nvcr.io/nvidia/tao/triton-apps:21.11-py3   "/opt/tritonserver/n…"   16 hours ago   Up 16 hours   0.0.0.0:8000-8002->8000-8002/tcp, :::8000-8002->8000-8002/tcp   brave_sammet

My docker images:

REPOSITORY                       TAG             IMAGE ID       CREATED         SIZE
nvcr.io/nvidia/tao/triton-apps   21.11-py3       f33160171d35   12 days ago     13.8GB
nvcr.io/nvidia/tritonserver      22.02-py3       d52ac03519ab   4 weeks ago     12.3GB
nvcr.io/nvidia/tritonserver      22.02-py3-sdk   2ae0b4a9046e   4 weeks ago     11.5GB
nvcr.io/nvidia/tritonserver      21.12-py3       d22d6cbf54b7   3 months ago    12.2GB
nvcr.io/nvidia/tritonserver      21.12-py3-sdk   2846ae651342   3 months ago    11.2GB
nvcr.io/nvidia/tritonserver      21.10-py3       5c99e9b6586e   5 months ago    13.7GB
nvidia/cuda                      11.0-base       2ec708416bb8   19 months ago   122MB

i just did a manual tao-converter at the docker instance by:

root@**9207ab950ed0**:/opt/tritonserver# tao-converter /tao_models/electric_bicycle_net_tao/final_model.etlt \
>               -k nvidia_tlt \
>               -d 3,224,224 \
>               -o predictions/Softmax \
>               -m 16 \
>               -e electric_bicycle_net_tao/new_generated_electric_bicycle_net_tao_model.plan

and still see the same error in:


trt_engine_path = os.path.join("electric_bicycle_net_tao/new_generated_electric_bicycle_net_tao_model.plan")
trt_engine = load_engine(trt_runtime, trt_engine_path)

OK, to narrow down, can you change the saving path "-e " and filename and retry?

-e /opt/tritonserver/xxx.engine

still the same error, actually there’s one line of code to make sure .engine file does exists:

if not os.path.exists(trt_engine_path):
        print("the engine file does not exists, quit!")
        exit()

and this was never hit in above experiments.

this is what I’ve done for this time to export a new name: abc.engine of engine file:

tao-converter /tao_models/electric_bicycle_net_tao/final_model.etlt               -k nvidia_tlt               -d 3,224,224               -o predictions/Softmax               -m 16               -e /opt/tritonserver/abc.engine                     
[INFO] [MemUsageChange] Init CUDA: CPU +534, GPU +0, now: CPU 540, GPU 1827 (MiB)
[INFO] [MemUsageSnapshot] Builder begin: CPU 629 MiB, GPU 1827 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +791, GPU +340, now: CPU 1464, GPU 2167 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +195, GPU +342, now: CPU 1659, GPU 2509 (MiB)
[WARNING] Detected invalid timing cache, setup a local cache instead
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[INFO] Detected 1 inputs and 1 output network tensors.
[INFO] Total Host Persistent Memory: 94352
[INFO] Total Device Persistent Memory: 46283264
[INFO] Total Scratch Memory: 0
[INFO] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 53 MiB, GPU 32 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2634, GPU 3035 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2634, GPU 3043 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2634, GPU 3027 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2634, GPU 3009 (MiB)
[INFO] [MemUsageSnapshot] Builder end: CPU 2634 MiB, GPU 3009 MiB
root@9207ab950ed0:/opt/tritonserver/mytest# mv ../abc.engine ./
root@9207ab950ed0:/opt/tritonserver/mytest# python3 infer_cls.py 
[03/24/2022-04:10:36] [TRT] [E] 1: [stdArchiveReader.cpp::StdArchiveReader::35] Error Code 1: Serialization (Serialization assertion safeVersionRead == safeSerializationVersion failed.Version tag does not match. Note: Current Version: 0, Serialized Engine Version: 43)
[03/24/2022-04:10:36] [TRT] [E] 4: [runtime.cpp::deserializeCudaEngine::50] Error Code 4: Internal Error (Engine deserialization failed.)
Traceback (most recent call last):
  File "infer_cls.py", line 86, in <module>
    h_input, d_input, h_output, d_output, stream = allocate_buffers(trt_engine)
  File "infer_cls.py", line 34, in allocate_buffers
    h_input = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(trt.float32))
AttributeError: 'NoneType' object has no attribute 'get_binding_shape'
root@9207ab950ed0:/opt/tritonserver/mytest#

I installed the python tensorrt package in tao-toolkit-triton-apps docker instance via:
python3 -m pip install --upgrade nvidia-tensorrt
is that possible cause the issue? I mean do I need specify the version number or sth else in the command?

The tensorrt version should the same when you generate trt engine and run inference with this trt engine.

I think I’ve showed the steps, and the tao-converter and inferencing are all in the same docker instance, what else could be check why the python script still show error.

To narrow down, can you let triton server generate vehicletypenet model and check if you can load that model.plan with your code?

just tried, still the same error by even load the vehicletypenet_tao/1/model.plan.

but if i use the client to infer with my model, it looks good (though bad accurate).

Hi Morgan, Today I used another Ubuntu 20, x64, RTX3090 machine to load the tao-toolkit-triton-apps docker again but just with a bit changes of ommit some models downloading and convert, as the downloading here cost lots of time and sometimes stuck.
The modified repo is from here. you can review that 2 commit.
This is the docker ps:

CONTAINER ID   IMAGE                                                     COMMAND                  CREATED          STATUS          PORTS                                                           NAMES
b2ba308f296a   nvcr.io/nvidia/tao/triton-apps:21.11-py3                  "/opt/tritonserver/n…"   36 minutes ago   Up 36 minutes   0.0.0

and this is the triton server console output:

...
...
...
+-------------+------------------------------------------------------+--------+

I0325 03:32:49.073422 57 server.cc:592] 
+--------------------+---------+--------+
| Model              | Version | Status |
+--------------------+---------+--------+
| vehicletypenet_tao | 1       | READY  |
+--------------------+---------+--------+

I0325 03:32:49.073464 57 tritonserver.cc:1920] 
+----------------------------------+------------------------------------------+
| Option                           | Value                                    |
+----------------------------------+------------------------------------------+
| server_id                        | triton                                   |
| server_version                   | 2.15.0                                   |
| server_extensions                | classification sequence model_repository |
|                                  |  model_repository(unload_dependents) sch |
|                                  | edule_policy model_configuration system_ |
|                                  | shared_memory cuda_shared_memory binary_ |
|                                  | tensor_data statistics                   |
| model_repository_path[0]         | /model_repository                        |
| model_control_mode               | MODE_NONE                                |
| strict_model_config              | 1                                        |
| rate_limit                       | OFF                                      |
| pinned_memory_pool_byte_size     | 268435456                                |
| cuda_memory_pool_byte_size{0}    | 67108864                                 |
| response_cache_byte_size         | 0                                        |
| min_supported_compute_capability | 6.0                                      |
| strict_readiness                 | 1                                        |
| exit_timeout                     | 30                                       |
+----------------------------------+------------------------------------------+

I0325 03:32:49.074034 57 grpc_server.cc:4117] Started GRPCInferenceService at 0.0.0.0:8001
I0325 03:32:49.074160 57 http_server.cc:2815] Started HTTPService at 0.0.0.0:8000
I0325 03:32:49.115288 57 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002

After docker exec -it xxxx bash into above docker instance, I installed these dependencies to start running the inferernce python app:

# python3 -m pip install --upgrade setuptools pip
# python3 -m pip install nvidia-pyindex
# python3 -m pip install --upgrade nvidia-tensorrt
# pip3 install opencv-python
# apt-get update && apt-get install libgl1
# python3 -m pip install numpy
# python3 -m pip install 'pycuda<2021.1'
# pip3 install pillow

still I was trying to load the vehicletypenet_tao/1/model.plan which was generated by docker itself via build-in tao-converter, but the same error still show:

[03/25/2022-03:58:28] [TRT] [E] 1: [stdArchiveReader.cpp::StdArchiveReader::35] Error Code 1: Serialization (Serialization assertion safeVersionRead == safeSerializationVersion failed.Version tag does not match. Note: Current Version: 0, Serialized Engine Version: 43)
[03/25/2022-03:58:28] [TRT] [E] 4: [runtime.cpp::deserializeCudaEngine::50] Error Code 4: Internal Error (Engine deserialization failed.)
Traceback (most recent call last):
  File "load_model_and_infer.py", line 98, in <module>
    h_input, d_input, h_output, d_output, stream = allocate_buffers(trt_engine)
  File "load_model_and_infer.py", line 47, in allocate_buffers
    h_input = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(trt.float32))
AttributeError: 'NoneType' object has no attribute 'get_binding_shape'

anything I can do further to narrow down the low accurate issue at triton-server?

For your side, please run “tao-converter” command to generate tensorrt engine again after above command. That means you will not let docker itself build model.plan.

$ tao-converter xxx

just tried in docker instance, run command:

tao-converter vehicletypenet_model/resnet18_vehicletypenet_pruned.etlt -k tlt_encode -c vehicletypenet_model/vehicletypenet_int8.txt -d 3,224,224 -o predictions/Softmax -t int8 -m 16 -e vehicletypenet_tao/1/latest_in_place_generated_model.plan

can see the generation finished:


...
...
tensor
[WARNING] Missing scale and zero-point for tensor block_4b_bn_shortcut/Reshape_2/shape, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor block_4b_bn_shortcut/moving_mean, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor block_4b_bn_shortcut/Reshape/shape, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor predictions/kernel, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor predictions/bias, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +791, GPU +340, now: CPU 1389, GPU 2805 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +196, GPU +342, now: CPU 1585, GPU 3147 (MiB)
[WARNING] Detected invalid timing cache, setup a local cache instead
[INFO] Detected 1 inputs and 1 output network tensors.
[INFO] Total Host Persistent Memory: 87408
[INFO] Total Device Persistent Memory: 5741568
[INFO] Total Scratch Memory: 0
[INFO] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 25 MiB, GPU 4 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 2566, GPU 3633 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2566, GPU 3641 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2565, GPU 3625 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2565, GPU 3607 (MiB)
[INFO] [MemUsageSnapshot] Builder end: CPU 2546 MiB, GPU 3607 MiB

then trying load this new genearted latest_in_place_generated_model.plan into the test python script, still the same error.

is that possible the python script is not correct?

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_runtime = trt.Runtime(TRT_LOGGER)
trt_engine = load_engine(trt_runtime, trt_engine_path)

def load_engine(trt_runtime, engine_path):
    with open(engine_path, "rb") as f:
        engine_data = f.read()
    engine = trt_runtime.deserialize_cuda_engine(engine_data)
    return engine

Please run below experiment in a new terminal of your host.
$ docker run --runtime=nvidia -it --rm -v yourfolder:/workspace nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3 /bin/bash

Above will login tao docker.
Then generate trt engine and run infernece.

# tao-converter final_model.etlt -k nvidia_tlt -o predictions/Softmax -d 3,224,224 -i nchw -m 64 -e sample_3.0.engine -b 64
# python infer_script.py

There is no problem to load the engine on my side.

thanks Morgan,
The infer_script.py finally works in the suggested docker of nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3 (surely the model.plan is converted inside of it).
The testing data are the same 240 electric-bicycle images which actually copied from part of model training dataset, the results comparison are listed below:

  • By runing infer_script.py inside docker tao-toolkit-tf
    only 68 are correctly recognized as electric-bicycle,the rest 172 are incorrectly recoginized as bicycle.

  • By using triton-server----tao-toolkit-triton-apps
    using image_client.py to call the triton service like:

    python3 image_client.py -m ele_two_vehicle_net_tao ~/Pictures/data/train/electric_bicycle/
    

    from console output result:
    only 64 are correctly recognized as electric-bicycle,the rest 176 are incorrectly recoginized as bicycle.

  • By using the TAO jupter notebook at my training machine
    command is:

    !tao classification inference -e $SPECS_DIR/classification_retrain_spec.cfg \
                              -m $USER_EXPERIMENT_DIR/output_retrain/weights/resnet_$EPOCH.tlt \
                              -k $KEY -b 32 -d $DATA_DOWNLOAD_DIR/split/compare_test/electric_bicycle \
                              -cm $USER_EXPERIMENT_DIR/output_retrain/classmap.json
    

    by checking the result.csv, 239 are correctly recognized as electric-bicycle,only 1 are incorrectly recoginized as bicycle.

could you help, why the accurate is so different?

For your infer_script.py, please modify according to Inferring resnet18 classification etlt model with python - #40 by Morganh

Add below.

from keras.applications.imagenet_utils import preprocess_input

And change

return np.asarray(image.resize((w, h), Image.ANTIALIAS)).transpose([2, 0, 1]).astype(trt.nptype(trt.float32)).ravel()

to

return preprocess_input(np.asarray(image.resize((w, h), Image.ANTIALIAS)).transpose([2, 0, 1]).astype(trt.nptype(trt.float32)), mode=‘caffe’, data_format=‘channels_first’).ravel()