Tao-converted .plan model running in triton-server turned to bad accurate

Morganh · March 23, 2022, 9:07am

In triton server, it already help you generate the trt engine. You can find your model (model.plan) . That is the tensorrt engine.

music1913 · March 24, 2022, 3:23am

I’ve docker exec -it xxx bash into the docker instance of tao-toolkit-triton-apps, after installed bunch of dependencies: nvidia-tensorrt, opencv-python, libgl1, pycuda, pillow, I run the python script infer_cls.py with small modifications from here, the script now is:

import os
import time
import cv2
#import matplotlib.pyplot as plt
import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt
from PIL import Image

class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()

def load_engine(trt_runtime, engine_path):
    with open(engine_path, "rb") as f:
        engine_data = f.read()
    engine = trt_runtime.deserialize_cuda_engine(engine_data)
    return engine

def allocate_buffers(engine):
    # Determine dimensions and create page-locked memory buffers (i.e. won't be swapped to disk) to hold host inputs/outputs.
    h_input = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(trt.float32))
    h_output = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(1)), dtype=trt.nptype(trt.float32))
    # Allocate device memory for inputs and outputs.
    d_input = cuda.mem_alloc(h_input.nbytes)
    d_output = cuda.mem_alloc(h_output.nbytes)
    # Create a stream in which to copy inputs/outputs and run inference.
    stream = cuda.Stream()
    return h_input, d_input, h_output, d_output, stream

def load_normalized_test_case(test_image, pagelocked_buffer):
    # Converts the input image to a CHW Numpy array
    def normalize_image(image):
        # Resize, antialias and transpose the image to CHW.
        c, h, w = 3,120,120
        return np.asarray(image.resize((w, h), Image.ANTIALIAS)).transpose([2, 0, 1]).astype(trt.nptype(trt.float32)).ravel()

    # Normalize the image and copy to pagelocked memory.
    np.copyto(pagelocked_buffer, normalize_image(Image.open(test_image)))
    return test_image

def do_inference(context, h_input, d_input, h_output, d_output, stream):
    # Transfer input data to the GPU.
    cuda.memcpy_htod_async(d_input, h_input, stream)
    # Run inference.
    context.execute_async(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
    # Transfer predictions back from the GPU.
    cuda.memcpy_dtoh_async(h_output, d_output, stream)
    # Synchronize the stream
    stream.synchronize()
    return h_output,h_input

if __name__ == '__main__':    
    neg = 0
    pos = 0
    count = 0
    
    # TensorRT logger singleton
    TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
    trt_engine_path = os.path.join("electric_bicycle_net_tao/1/model.plan")
    if not os.path.exists(trt_engine_path) 
		print("the engine file does not exists, quit!")
		exit()
    trt_runtime = trt.Runtime(TRT_LOGGER)
    trt_engine = load_engine(trt_runtime, trt_engine_path)
    
    # This allocates memory for network inputs/outputs on both CPU and GPU
    h_input, d_input, h_output, d_output, stream = allocate_buffers(trt_engine)
    
    # Execution context is needed for inference
    context = trt_engine.create_execution_context()


    # -------------- MODEL PARAMETERS FOR THE MODEL --------------------------------
    model_h = 120
    model_w = 120
    img_dir = "data/"
    
    folders = os.listdir(img_dir)
    
    for sub_folder in folders:
        #loop over the folders
        images = os.listdir(img_dir + sub_folder)
        
        for i in images:
            #loop over the images
            
            test_image = img_dir + sub_folder + "/" + i
            
            labels_file = "electric_bicycle_net_tao/labels.txt"
            labels = open(labels_file, 'r').read().split('\n')
            
            test_case = load_normalized_test_case(test_image, h_input)
            
            start_time = time.time()
            h_output,h_input = do_inference(context, h_input, d_input, h_output, d_output, stream)
            pred = labels[np.argmax(h_output)]
            
            #print (test_image)
            print ("class: ",pred,", Confidence: ", max(h_output))
            print ("Inference Time : ",time.time()-start_time)
            
            if pred == "negative":
                neg +=1
            if pred == "positive":
                pos+=1
                print (test_image)
                #time.sleep(3)
                
            #time.sleep(1)
            count += 1
    
    print ("Total Number of items in the directory : ",count)
    print ("Total number of Positive Items : ",pos)
    print ("Total number of Negative Items : ",neg)

it shows an error:

root@9207ab950ed0:/opt/tritonserver/mytest# python3 infer_cls.py
[03/24/2022-03:17:41] [TRT] [E] 1: [stdArchiveReader.cpp::StdArchiveReader::35] Error Code 1: Serialization (Serialization assertion safeVersionRead == safeSerializationVersion failed.Version tag does not match. Note: Current Version: 0, Serialized Engine Version: 43)
[03/24/2022-03:17:41] [TRT] [E] 4: [runtime.cpp::deserializeCudaEngine::50] Error Code 4: Internal Error (Engine deserialization failed.)
Traceback (most recent call last):
File “infer_cls.py”, line 85, in
h_input, d_input, h_output, d_output, stream = allocate_buffers(trt_engine)
File “infer_cls.py”, line 34, in allocate_buffers
h_input = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(trt.float32))
AttributeError: ‘NoneType’ object has no attribute ‘get_binding_shape’

this is my folder structure:

root@9207ab950ed0:/opt/tritonserver/mytest# ls
data  electric_bicycle_net_tao  infer_cls.py
root@9207ab950ed0:/opt/tritonserver/mytest# ls electric_bicycle_net_tao/1/     
model.plan

could you help?

Morganh · March 24, 2022, 3:29am

That means the existing tensort engine does not match your environment.
You can refer to below command to generate tensorrt engine again.

tao-converter /tao_models/vehicletypenet_model/resnet18_vehicletypenet_pruned.etlt
-k tlt_encode
-c /tao_models/vehicletypenet_model/vehicletypenet_int8.txt
-d 3,224,224
-o predictions/Softmax
-t int8
-m 16
-e /model_repository/vehicletypenet_tao/1/model.plan

music1913 · March 24, 2022, 3:36am

I think the above model was actually generated from the repo of tao-toolkit-triton-apps.
As I cloned the tao-toolkit-triton-apps. and modified a bit in tao-toolkit-triton-apps/download_and_convert.sh:

echo "Converting the Electric_bicycle_net_tao model"
mkdir -p /model_repository/electric_bicycle_net_tao/1
tao-converter /tao_models/electric_bicycle_net_tao/final_model.etlt \
              -k nvidia_tlt \
              -d 3,224,224 \
              -o predictions/Softmax \
              -m 16 \
              -e /model_repository/electric_bicycle_net_tao/1/model.plan

At the starting of the triton-server, I can see the model was correctly converted:


...
...
...
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2565, GPU 1702 (MiB)
[INFO] [MemUsageSnapshot] Builder end: CPU 2546 MiB, GPU 1702 MiB
Converting the Electric_bicycle_net_tao model
[INFO] [MemUsageChange] Init CUDA: CPU +534, GPU +0, now: CPU 540, GPU 560 (MiB)
[INFO] [MemUsageSnapshot] Builder begin: CPU 629 MiB, GPU 560 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +791, GPU +340, now: CPU 1464, GPU 900 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +195, GPU +342, now: CPU 1659, GPU 1242 (MiB)
[WARNING] Detected invalid timing cache, setup a local cache instead
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[INFO] Detected 1 inputs and 1 output network tensors.
[INFO] Total Host Persistent Memory: 94864
[INFO] Total Device Persistent Memory: 46283264
[INFO] Total Scratch Memory: 0
[INFO] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 53 MiB, GPU 32 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2634, GPU 1768 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2634, GPU 1776 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2634, GPU 1760 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2634, GPU 1742 (MiB)
[INFO] [MemUsageSnapshot] Builder end: CPU 2634 MiB, GPU 1742 MiB
I0323 11:48:38.459937 64 metrics.cc:298] Collecting metrics for GPU 0: NVIDIA GeForce RTX 3060
I0323 11:48:38.625626 64 libtorch.cc:1092] TRITONBACKEND_Initialize: pytorch
I0323 11:48:38.625643 64 libtorch.cc:1102] Triton TRITONBACKEND API version: 1.6
I0323 11:48:38.625646 64 libtorch.cc:1108] 'pytorch' TRITONBACKEND API version: 1.6
2022-03-23 11:48:38.818313: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0323 11:48:38.844581 64 tensorflow.cc:2170] TRITONBACKEND_Initialize: tensorflow
I0323 11:48:38.844603 64 tensorflow.cc:2180] Triton TRITONBACKEND API version: 1.6
I0323 11:48:38.844606 64 tensorflow.cc:2186] 'tensorflow' TRITONBACKEND API version: 1.6
I0323 11:48:38.844609 64 tensorflow.cc:2210] backend configuration:
{}
I0323 11:48:38.845561 64 onnxruntime.cc:1999] TRITONBACKEND_Initialize: onnxruntime
I0323 11:48:38.845572 64 onnxruntime.cc:2009] Triton TRITONBACKEND API version: 1.6
I0323 11:48:38.845575 64 onnxruntime.cc:2015] 'onnxruntime' TRITONBACKEND API version: 1.6
I0323 11:48:38.876770 64 openvino.cc:1193] TRITONBACKEND_Initialize: openvino
I0323 11:48:38.876789 64 openvino.cc:1203] Triton TRITONBACKEND API version: 1.6
I0323 11:48:38.876792 64 openvino.cc:1209] 'openvino' TRITONBACKEND API version: 1.6
I0323 11:48:39.002462 64 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fd9e6000000' with size 268435456
I0323 11:48:39.002599 64 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0323 11:48:39.003531 64 model_repository_manager.cc:1022] loading: vehicletypenet_tao:1
I0323 11:48:39.103978 64 model_repository_manager.cc:1022] loading: electric_bicycle_net_tao:1
I0323 11:48:39.133371 64 tensorrt.cc:4925] TRITONBACKEND_Initialize: tensorrt
I0323 11:48:39.133394 64 tensorrt.cc:4935] Triton TRITONBACKEND API version: 1.6
I0323 11:48:39.133398 64 tensorrt.cc:4941] 'tensorrt' TRITONBACKEND API version: 1.6
I0323 11:48:39.133477 64 tensorrt.cc:4984] backend configuration:
{}
I0323 11:48:39.133680 64 tensorrt.cc:5036] TRITONBACKEND_ModelInitialize: vehicletypenet_tao (version 1)
I0323 11:48:39.135022 64 tensorrt.cc:5085] TRITONBACKEND_ModelInstanceInitialize: vehicletypenet_tao (GPU device 0)
I0323 11:48:39.509504 64 logging.cc:49] [MemUsageChange] Init CUDA: CPU +525, GPU +0, now: CPU 648, GPU 624 (MiB)
I0323 11:48:39.513867 64 logging.cc:49] Loaded engine size: 5 MB
I0323 11:48:39.513959 64 logging.cc:49] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 659 MiB, GPU 624 MiB
I0323 11:48:40.022610 64 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +792, GPU +340, now: CPU 1451, GPU 970 (MiB)
I0323 11:48:40.441824 64 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +195, GPU +336, now: CPU 1646, GPU 1306 (MiB)
I0323 11:48:40.442715 64 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1646, GPU 1288 (MiB)
I0323 11:48:40.442756 64 logging.cc:49] [MemUsageSnapshot] deserializeCudaEngine end: CPU 1646 MiB, GPU 1288 MiB
I0323 11:48:40.442897 64 logging.cc:49] [MemUsageSnapshot] ExecutionContext creation begin: CPU 1635 MiB, GPU 1288 MiB
I0323 11:48:40.443158 64 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 1635, GPU 1298 (MiB)
I0323 11:48:40.443830 64 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1635, GPU 1306 (MiB)
I0323 11:48:40.444192 64 logging.cc:49] [MemUsageSnapshot] ExecutionContext creation end: CPU 1635 MiB, GPU 1326 MiB
I0323 11:48:40.444274 64 tensorrt.cc:1379] Created instance vehicletypenet_tao on GPU 0 with stream priority 0
I0323 11:48:40.444292 64 tensorrt.cc:5036] TRITONBACKEND_ModelInitialize: electric_bicycle_net_tao (version 1)
I0323 11:48:40.444393 64 model_repository_manager.cc:1183] successfully loaded 'vehicletypenet_tao' version 1
I0323 11:48:40.445171 64 tensorrt.cc:5085] TRITONBACKEND_ModelInstanceInitialize: electric_bicycle_net_tao (GPU device 0)
I0323 11:48:40.445384 64 logging.cc:49] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1635, GPU 1326 (MiB)
I0323 11:48:40.475368 64 logging.cc:49] Loaded engine size: 44 MB
I0323 11:48:40.475466 64 logging.cc:49] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 1724 MiB, GPU 1326 MiB
I0323 11:48:40.569055 64 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1724, GPU 1386 (MiB)
I0323 11:48:40.569464 64 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 1724, GPU 1396 (MiB)
I0323 11:48:40.569926 64 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1724, GPU 1380 (MiB)
I0323 11:48:40.569967 64 logging.cc:49] [MemUsageSnapshot] deserializeCudaEngine end: CPU 1724 MiB, GPU 1380 MiB
I0323 11:48:40.572716 64 logging.cc:49] [MemUsageSnapshot] ExecutionContext creation begin: CPU 1635 MiB, GPU 1380 MiB
I0323 11:48:40.572975 64 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 1636, GPU 1388 (MiB)
I0323 11:48:40.573242 64 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1636, GPU 1396 (MiB)
I0323 11:48:40.573912 64 logging.cc:49] [MemUsageSnapshot] ExecutionContext creation end: CPU 1636 MiB, GPU 1516 MiB
I0323 11:48:40.573991 64 tensorrt.cc:1379] Created instance electric_bicycle_net_tao on GPU 0 with stream priority 0
I0323 11:48:40.574091 64 model_repository_manager.cc:1183] successfully loaded 'electric_bicycle_net_tao' version 1
I0323 11:48:40.574142 64 server.cc:522] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0323 11:48:40.574177 64 server.cc:549] 
+-------------+-----------------------------------------------------------------+--------+
| Backend     | Path                                                            | Config |
+-------------+-----------------------------------------------------------------+--------+
| pytorch     | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so         | {}     |
| tensorflow  | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so | {}     |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {}     |
| openvino    | /opt/tritonserver/backends/openvino/libtriton_openvino.so       | {}     |
| tensorrt    | /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so       | {}     |
+-------------+-----------------------------------------------------------------+--------+

I0323 11:48:40.574200 64 server.cc:592] 
+--------------------------+---------+--------+
| Model                    | Version | Status |
+--------------------------+---------+--------+
| electric_bicycle_net_tao | 1       | READY  |
| vehicletypenet_tao       | 1       | READY  |
+--------------------------+---------+--------+
...
...
...

and the converted model is the one I used in the python test script.

Morganh · March 24, 2022, 3:45am

Actually the triton server is generating tensort engine in a new docker , see

https://github.com/NVIDIA-AI-IOT/tao-toolkit-triton-apps/blob/main/scripts/config.sh#L26
https://github.com/NVIDIA-AI-IOT/tao-toolkit-triton-apps/blob/main/scripts/config.sh#L27
https://github.com/NVIDIA-AI-IOT/tao-toolkit-triton-apps/blob/main/scripts/start_server.sh#L59
https://github.com/NVIDIA-AI-IOT/tao-toolkit-triton-apps/blob/main/scripts/start_server.sh#L97

According to above, can you check “$ docker images” and login in that model to generate tensorrt engine?

music1913 · March 24, 2022, 3:54am

My docker ps, show only 1 instance there, and all above operations (run python script to start the classification inference, and got errors) are in it:

sudo docker ps
CONTAINER ID   IMAGE                                      COMMAND                  CREATED        STATUS        PORTS                                                           NAMES
9207ab950ed0   nvcr.io/nvidia/tao/triton-apps:21.11-py3   "/opt/tritonserver/n…"   16 hours ago   Up 16 hours   0.0.0.0:8000-8002->8000-8002/tcp, :::8000-8002->8000-8002/tcp   brave_sammet

My docker images:

REPOSITORY                       TAG             IMAGE ID       CREATED         SIZE
nvcr.io/nvidia/tao/triton-apps   21.11-py3       f33160171d35   12 days ago     13.8GB
nvcr.io/nvidia/tritonserver      22.02-py3       d52ac03519ab   4 weeks ago     12.3GB
nvcr.io/nvidia/tritonserver      22.02-py3-sdk   2ae0b4a9046e   4 weeks ago     11.5GB
nvcr.io/nvidia/tritonserver      21.12-py3       d22d6cbf54b7   3 months ago    12.2GB
nvcr.io/nvidia/tritonserver      21.12-py3-sdk   2846ae651342   3 months ago    11.2GB
nvcr.io/nvidia/tritonserver      21.10-py3       5c99e9b6586e   5 months ago    13.7GB
nvidia/cuda                      11.0-base       2ec708416bb8   19 months ago   122MB

i just did a manual tao-converter at the docker instance by:

root@**9207ab950ed0**:/opt/tritonserver# tao-converter /tao_models/electric_bicycle_net_tao/final_model.etlt \
>               -k nvidia_tlt \
>               -d 3,224,224 \
>               -o predictions/Softmax \
>               -m 16 \
>               -e electric_bicycle_net_tao/new_generated_electric_bicycle_net_tao_model.plan

and still see the same error in:


trt_engine_path = os.path.join("electric_bicycle_net_tao/new_generated_electric_bicycle_net_tao_model.plan")
trt_engine = load_engine(trt_runtime, trt_engine_path)

Morganh · March 24, 2022, 4:05am

OK, to narrow down, can you change the saving path "-e " and filename and retry?

-e /opt/tritonserver/xxx.engine

music1913 · March 24, 2022, 4:11am

still the same error, actually there’s one line of code to make sure .engine file does exists:

if not os.path.exists(trt_engine_path):
        print("the engine file does not exists, quit!")
        exit()

and this was never hit in above experiments.

this is what I’ve done for this time to export a new name: abc.engine of engine file:

tao-converter /tao_models/electric_bicycle_net_tao/final_model.etlt               -k nvidia_tlt               -d 3,224,224               -o predictions/Softmax               -m 16               -e /opt/tritonserver/abc.engine                     
[INFO] [MemUsageChange] Init CUDA: CPU +534, GPU +0, now: CPU 540, GPU 1827 (MiB)
[INFO] [MemUsageSnapshot] Builder begin: CPU 629 MiB, GPU 1827 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +791, GPU +340, now: CPU 1464, GPU 2167 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +195, GPU +342, now: CPU 1659, GPU 2509 (MiB)
[WARNING] Detected invalid timing cache, setup a local cache instead
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[INFO] Detected 1 inputs and 1 output network tensors.
[INFO] Total Host Persistent Memory: 94352
[INFO] Total Device Persistent Memory: 46283264
[INFO] Total Scratch Memory: 0
[INFO] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 53 MiB, GPU 32 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2634, GPU 3035 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2634, GPU 3043 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2634, GPU 3027 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2634, GPU 3009 (MiB)
[INFO] [MemUsageSnapshot] Builder end: CPU 2634 MiB, GPU 3009 MiB
root@9207ab950ed0:/opt/tritonserver/mytest# mv ../abc.engine ./
root@9207ab950ed0:/opt/tritonserver/mytest# python3 infer_cls.py 
[03/24/2022-04:10:36] [TRT] [E] 1: [stdArchiveReader.cpp::StdArchiveReader::35] Error Code 1: Serialization (Serialization assertion safeVersionRead == safeSerializationVersion failed.Version tag does not match. Note: Current Version: 0, Serialized Engine Version: 43)
[03/24/2022-04:10:36] [TRT] [E] 4: [runtime.cpp::deserializeCudaEngine::50] Error Code 4: Internal Error (Engine deserialization failed.)
Traceback (most recent call last):
  File "infer_cls.py", line 86, in <module>
    h_input, d_input, h_output, d_output, stream = allocate_buffers(trt_engine)
  File "infer_cls.py", line 34, in allocate_buffers
    h_input = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(trt.float32))
AttributeError: 'NoneType' object has no attribute 'get_binding_shape'
root@9207ab950ed0:/opt/tritonserver/mytest#

music1913 · March 24, 2022, 7:13am

I installed the python tensorrt package in tao-toolkit-triton-apps docker instance via:
python3 -m pip install --upgrade nvidia-tensorrt
is that possible cause the issue? I mean do I need specify the version number or sth else in the command?

Morganh · March 24, 2022, 7:51am

The tensorrt version should the same when you generate trt engine and run inference with this trt engine.

music1913 · March 24, 2022, 8:01am

I think I’ve showed the steps, and the tao-converter and inferencing are all in the same docker instance, what else could be check why the python script still show error.

Morganh · March 24, 2022, 9:10am

To narrow down, can you let triton server generate vehicletypenet model and check if you can load that model.plan with your code?

music1913 · March 24, 2022, 9:37am

just tried, still the same error by even load the vehicletypenet_tao/1/model.plan.

but if i use the client to infer with my model, it looks good (though bad accurate).

music1913 · March 25, 2022, 4:08am

Hi Morgan, Today I used another Ubuntu 20, x64, RTX3090 machine to load the tao-toolkit-triton-apps docker again but just with a bit changes of ommit some models downloading and convert, as the downloading here cost lots of time and sometimes stuck.
The modified repo is from here. you can review that 2 commit.
This is the docker ps:

CONTAINER ID   IMAGE                                                     COMMAND                  CREATED          STATUS          PORTS                                                           NAMES
b2ba308f296a   nvcr.io/nvidia/tao/triton-apps:21.11-py3                  "/opt/tritonserver/n…"   36 minutes ago   Up 36 minutes   0.0.0

and this is the triton server console output:

...
...
...
+-------------+------------------------------------------------------+--------+

I0325 03:32:49.073422 57 server.cc:592] 
+--------------------+---------+--------+
| Model              | Version | Status |
+--------------------+---------+--------+
| vehicletypenet_tao | 1       | READY  |
+--------------------+---------+--------+

I0325 03:32:49.073464 57 tritonserver.cc:1920] 
+----------------------------------+------------------------------------------+
| Option                           | Value                                    |
+----------------------------------+------------------------------------------+
| server_id                        | triton                                   |
| server_version                   | 2.15.0                                   |
| server_extensions                | classification sequence model_repository |
|                                  |  model_repository(unload_dependents) sch |
|                                  | edule_policy model_configuration system_ |
|                                  | shared_memory cuda_shared_memory binary_ |
|                                  | tensor_data statistics                   |
| model_repository_path[0]         | /model_repository                        |
| model_control_mode               | MODE_NONE                                |
| strict_model_config              | 1                                        |
| rate_limit                       | OFF                                      |
| pinned_memory_pool_byte_size     | 268435456                                |
| cuda_memory_pool_byte_size{0}    | 67108864                                 |
| response_cache_byte_size         | 0                                        |
| min_supported_compute_capability | 6.0                                      |
| strict_readiness                 | 1                                        |
| exit_timeout                     | 30                                       |
+----------------------------------+------------------------------------------+

I0325 03:32:49.074034 57 grpc_server.cc:4117] Started GRPCInferenceService at 0.0.0.0:8001
I0325 03:32:49.074160 57 http_server.cc:2815] Started HTTPService at 0.0.0.0:8000
I0325 03:32:49.115288 57 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002

After docker exec -it xxxx bash into above docker instance, I installed these dependencies to start running the inferernce python app:

# python3 -m pip install --upgrade setuptools pip
# python3 -m pip install nvidia-pyindex
# python3 -m pip install --upgrade nvidia-tensorrt
# pip3 install opencv-python
# apt-get update && apt-get install libgl1
# python3 -m pip install numpy
# python3 -m pip install 'pycuda<2021.1'
# pip3 install pillow

still I was trying to load the vehicletypenet_tao/1/model.plan which was generated by docker itself via build-in tao-converter, but the same error still show:

[03/25/2022-03:58:28] [TRT] [E] 1: [stdArchiveReader.cpp::StdArchiveReader::35] Error Code 1: Serialization (Serialization assertion safeVersionRead == safeSerializationVersion failed.Version tag does not match. Note: Current Version: 0, Serialized Engine Version: 43)
[03/25/2022-03:58:28] [TRT] [E] 4: [runtime.cpp::deserializeCudaEngine::50] Error Code 4: Internal Error (Engine deserialization failed.)
Traceback (most recent call last):
  File "load_model_and_infer.py", line 98, in <module>
    h_input, d_input, h_output, d_output, stream = allocate_buffers(trt_engine)
  File "load_model_and_infer.py", line 47, in allocate_buffers
    h_input = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(trt.float32))
AttributeError: 'NoneType' object has no attribute 'get_binding_shape'

music1913 · March 26, 2022, 3:11am

anything I can do further to narrow down the low accurate issue at triton-server?

Morganh · March 26, 2022, 3:17am

music1913:

After docker exec -it xxxx bash into above docker instance, I installed these dependencies to start running the inferernce python app:

# python3 -m pip install --upgrade setuptools pip
# python3 -m pip install nvidia-pyindex
# python3 -m pip install --upgrade nvidia-tensorrt
# pip3 install opencv-python
# apt-get update && apt-get install libgl1
# python3 -m pip install numpy
# python3 -m pip install 'pycuda<2021.1'
# pip3 install pillow

For your side, please run “tao-converter” command to generate tensorrt engine again after above command. That means you will not let docker itself build model.plan.

$ tao-converter xxx

music1913 · March 26, 2022, 3:33am

just tried in docker instance, run command:

tao-converter vehicletypenet_model/resnet18_vehicletypenet_pruned.etlt -k tlt_encode -c vehicletypenet_model/vehicletypenet_int8.txt -d 3,224,224 -o predictions/Softmax -t int8 -m 16 -e vehicletypenet_tao/1/latest_in_place_generated_model.plan

can see the generation finished:


...
...
tensor
[WARNING] Missing scale and zero-point for tensor block_4b_bn_shortcut/Reshape_2/shape, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor block_4b_bn_shortcut/moving_mean, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor block_4b_bn_shortcut/Reshape/shape, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor predictions/kernel, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor predictions/bias, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +791, GPU +340, now: CPU 1389, GPU 2805 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +196, GPU +342, now: CPU 1585, GPU 3147 (MiB)
[WARNING] Detected invalid timing cache, setup a local cache instead
[INFO] Detected 1 inputs and 1 output network tensors.
[INFO] Total Host Persistent Memory: 87408
[INFO] Total Device Persistent Memory: 5741568
[INFO] Total Scratch Memory: 0
[INFO] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 25 MiB, GPU 4 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 2566, GPU 3633 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2566, GPU 3641 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2565, GPU 3625 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2565, GPU 3607 (MiB)
[INFO] [MemUsageSnapshot] Builder end: CPU 2546 MiB, GPU 3607 MiB

then trying load this new genearted latest_in_place_generated_model.plan into the test python script, still the same error.

is that possible the python script is not correct?

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_runtime = trt.Runtime(TRT_LOGGER)
trt_engine = load_engine(trt_runtime, trt_engine_path)

def load_engine(trt_runtime, engine_path):
    with open(engine_path, "rb") as f:
        engine_data = f.read()
    engine = trt_runtime.deserialize_cuda_engine(engine_data)
    return engine

Morganh · March 26, 2022, 5:06am

Please run below experiment in a new terminal of your host.
$ docker run --runtime=nvidia -it --rm -v yourfolder:/workspace nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3 /bin/bash

Above will login tao docker.
Then generate trt engine and run infernece.

# tao-converter final_model.etlt -k nvidia_tlt -o predictions/Softmax -d 3,224,224 -i nchw -m 64 -e sample_3.0.engine -b 64
# python infer_script.py

There is no problem to load the engine on my side.

music1913 · March 26, 2022, 8:07am

thanks Morgan,
The infer_script.py finally works in the suggested docker of nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3 (surely the model.plan is converted inside of it).
The testing data are the same 240 electric-bicycle images which actually copied from part of model training dataset, the results comparison are listed below:

By runing infer_script.py inside docker tao-toolkit-tf
only 68 are correctly recognized as electric-bicycle,the rest 172 are incorrectly recoginized as bicycle.
By using triton-server----tao-toolkit-triton-apps
using image_client.py to call the triton service like:
```
python3 image_client.py -m ele_two_vehicle_net_tao ~/Pictures/data/train/electric_bicycle/
```
from console output result:
only 64 are correctly recognized as electric-bicycle,the rest 176 are incorrectly recoginized as bicycle.

By using the TAO jupter notebook at my training machine
command is:

!tao classification inference -e $SPECS_DIR/classification_retrain_spec.cfg \
                          -m $USER_EXPERIMENT_DIR/output_retrain/weights/resnet_$EPOCH.tlt \
                          -k $KEY -b 32 -d $DATA_DOWNLOAD_DIR/split/compare_test/electric_bicycle \
                          -cm $USER_EXPERIMENT_DIR/output_retrain/classmap.json

by checking the result.csv, 239 are correctly recognized as electric-bicycle,only 1 are incorrectly recoginized as bicycle.

could you help, why the accurate is so different?

Morganh · March 26, 2022, 9:44am

For your infer_script.py, please modify according to Inferring resnet18 classification etlt model with python - #40 by Morganh

Add below.

from keras.applications.imagenet_utils import preprocess_input

And change

return np.asarray(image.resize((w, h), Image.ANTIALIAS)).transpose([2, 0, 1]).astype(trt.nptype(trt.float32)).ravel()

to

return preprocess_input(np.asarray(image.resize((w, h), Image.ANTIALIAS)).transpose([2, 0, 1]).astype(trt.nptype(trt.float32)), mode=‘caffe’, data_format=‘channels_first’).ravel()