Tao-converted .plan model running in triton-server turned to bad accurate

still the same error, actually there’s one line of code to make sure .engine file does exists:

if not os.path.exists(trt_engine_path):
        print("the engine file does not exists, quit!")
        exit()

and this was never hit in above experiments.

this is what I’ve done for this time to export a new name: abc.engine of engine file:

tao-converter /tao_models/electric_bicycle_net_tao/final_model.etlt               -k nvidia_tlt               -d 3,224,224               -o predictions/Softmax               -m 16               -e /opt/tritonserver/abc.engine                     
[INFO] [MemUsageChange] Init CUDA: CPU +534, GPU +0, now: CPU 540, GPU 1827 (MiB)
[INFO] [MemUsageSnapshot] Builder begin: CPU 629 MiB, GPU 1827 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +791, GPU +340, now: CPU 1464, GPU 2167 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +195, GPU +342, now: CPU 1659, GPU 2509 (MiB)
[WARNING] Detected invalid timing cache, setup a local cache instead
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[INFO] Detected 1 inputs and 1 output network tensors.
[INFO] Total Host Persistent Memory: 94352
[INFO] Total Device Persistent Memory: 46283264
[INFO] Total Scratch Memory: 0
[INFO] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 53 MiB, GPU 32 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2634, GPU 3035 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2634, GPU 3043 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2634, GPU 3027 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2634, GPU 3009 (MiB)
[INFO] [MemUsageSnapshot] Builder end: CPU 2634 MiB, GPU 3009 MiB
root@9207ab950ed0:/opt/tritonserver/mytest# mv ../abc.engine ./
root@9207ab950ed0:/opt/tritonserver/mytest# python3 infer_cls.py 
[03/24/2022-04:10:36] [TRT] [E] 1: [stdArchiveReader.cpp::StdArchiveReader::35] Error Code 1: Serialization (Serialization assertion safeVersionRead == safeSerializationVersion failed.Version tag does not match. Note: Current Version: 0, Serialized Engine Version: 43)
[03/24/2022-04:10:36] [TRT] [E] 4: [runtime.cpp::deserializeCudaEngine::50] Error Code 4: Internal Error (Engine deserialization failed.)
Traceback (most recent call last):
  File "infer_cls.py", line 86, in <module>
    h_input, d_input, h_output, d_output, stream = allocate_buffers(trt_engine)
  File "infer_cls.py", line 34, in allocate_buffers
    h_input = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(trt.float32))
AttributeError: 'NoneType' object has no attribute 'get_binding_shape'
root@9207ab950ed0:/opt/tritonserver/mytest#

I installed the python tensorrt package in tao-toolkit-triton-apps docker instance via:
python3 -m pip install --upgrade nvidia-tensorrt
is that possible cause the issue? I mean do I need specify the version number or sth else in the command?

The tensorrt version should the same when you generate trt engine and run inference with this trt engine.

I think I’ve showed the steps, and the tao-converter and inferencing are all in the same docker instance, what else could be check why the python script still show error.

To narrow down, can you let triton server generate vehicletypenet model and check if you can load that model.plan with your code?

just tried, still the same error by even load the vehicletypenet_tao/1/model.plan.

but if i use the client to infer with my model, it looks good (though bad accurate).

Hi Morgan, Today I used another Ubuntu 20, x64, RTX3090 machine to load the tao-toolkit-triton-apps docker again but just with a bit changes of ommit some models downloading and convert, as the downloading here cost lots of time and sometimes stuck.
The modified repo is from here. you can review that 2 commit.
This is the docker ps:

CONTAINER ID   IMAGE                                                     COMMAND                  CREATED          STATUS          PORTS                                                           NAMES
b2ba308f296a   nvcr.io/nvidia/tao/triton-apps:21.11-py3                  "/opt/tritonserver/n…"   36 minutes ago   Up 36 minutes   0.0.0

and this is the triton server console output:

...
...
...
+-------------+------------------------------------------------------+--------+

I0325 03:32:49.073422 57 server.cc:592] 
+--------------------+---------+--------+
| Model              | Version | Status |
+--------------------+---------+--------+
| vehicletypenet_tao | 1       | READY  |
+--------------------+---------+--------+

I0325 03:32:49.073464 57 tritonserver.cc:1920] 
+----------------------------------+------------------------------------------+
| Option                           | Value                                    |
+----------------------------------+------------------------------------------+
| server_id                        | triton                                   |
| server_version                   | 2.15.0                                   |
| server_extensions                | classification sequence model_repository |
|                                  |  model_repository(unload_dependents) sch |
|                                  | edule_policy model_configuration system_ |
|                                  | shared_memory cuda_shared_memory binary_ |
|                                  | tensor_data statistics                   |
| model_repository_path[0]         | /model_repository                        |
| model_control_mode               | MODE_NONE                                |
| strict_model_config              | 1                                        |
| rate_limit                       | OFF                                      |
| pinned_memory_pool_byte_size     | 268435456                                |
| cuda_memory_pool_byte_size{0}    | 67108864                                 |
| response_cache_byte_size         | 0                                        |
| min_supported_compute_capability | 6.0                                      |
| strict_readiness                 | 1                                        |
| exit_timeout                     | 30                                       |
+----------------------------------+------------------------------------------+

I0325 03:32:49.074034 57 grpc_server.cc:4117] Started GRPCInferenceService at 0.0.0.0:8001
I0325 03:32:49.074160 57 http_server.cc:2815] Started HTTPService at 0.0.0.0:8000
I0325 03:32:49.115288 57 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002

After docker exec -it xxxx bash into above docker instance, I installed these dependencies to start running the inferernce python app:

# python3 -m pip install --upgrade setuptools pip
# python3 -m pip install nvidia-pyindex
# python3 -m pip install --upgrade nvidia-tensorrt
# pip3 install opencv-python
# apt-get update && apt-get install libgl1
# python3 -m pip install numpy
# python3 -m pip install 'pycuda<2021.1'
# pip3 install pillow

still I was trying to load the vehicletypenet_tao/1/model.plan which was generated by docker itself via build-in tao-converter, but the same error still show:

[03/25/2022-03:58:28] [TRT] [E] 1: [stdArchiveReader.cpp::StdArchiveReader::35] Error Code 1: Serialization (Serialization assertion safeVersionRead == safeSerializationVersion failed.Version tag does not match. Note: Current Version: 0, Serialized Engine Version: 43)
[03/25/2022-03:58:28] [TRT] [E] 4: [runtime.cpp::deserializeCudaEngine::50] Error Code 4: Internal Error (Engine deserialization failed.)
Traceback (most recent call last):
  File "load_model_and_infer.py", line 98, in <module>
    h_input, d_input, h_output, d_output, stream = allocate_buffers(trt_engine)
  File "load_model_and_infer.py", line 47, in allocate_buffers
    h_input = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(trt.float32))
AttributeError: 'NoneType' object has no attribute 'get_binding_shape'

anything I can do further to narrow down the low accurate issue at triton-server?

For your side, please run “tao-converter” command to generate tensorrt engine again after above command. That means you will not let docker itself build model.plan.

$ tao-converter xxx

just tried in docker instance, run command:

tao-converter vehicletypenet_model/resnet18_vehicletypenet_pruned.etlt -k tlt_encode -c vehicletypenet_model/vehicletypenet_int8.txt -d 3,224,224 -o predictions/Softmax -t int8 -m 16 -e vehicletypenet_tao/1/latest_in_place_generated_model.plan

can see the generation finished:


...
...
tensor
[WARNING] Missing scale and zero-point for tensor block_4b_bn_shortcut/Reshape_2/shape, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor block_4b_bn_shortcut/moving_mean, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor block_4b_bn_shortcut/Reshape/shape, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor predictions/kernel, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor predictions/bias, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +791, GPU +340, now: CPU 1389, GPU 2805 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +196, GPU +342, now: CPU 1585, GPU 3147 (MiB)
[WARNING] Detected invalid timing cache, setup a local cache instead
[INFO] Detected 1 inputs and 1 output network tensors.
[INFO] Total Host Persistent Memory: 87408
[INFO] Total Device Persistent Memory: 5741568
[INFO] Total Scratch Memory: 0
[INFO] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 25 MiB, GPU 4 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 2566, GPU 3633 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2566, GPU 3641 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2565, GPU 3625 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2565, GPU 3607 (MiB)
[INFO] [MemUsageSnapshot] Builder end: CPU 2546 MiB, GPU 3607 MiB

then trying load this new genearted latest_in_place_generated_model.plan into the test python script, still the same error.

is that possible the python script is not correct?

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_runtime = trt.Runtime(TRT_LOGGER)
trt_engine = load_engine(trt_runtime, trt_engine_path)

def load_engine(trt_runtime, engine_path):
    with open(engine_path, "rb") as f:
        engine_data = f.read()
    engine = trt_runtime.deserialize_cuda_engine(engine_data)
    return engine

Please run below experiment in a new terminal of your host.
$ docker run --runtime=nvidia -it --rm -v yourfolder:/workspace nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3 /bin/bash

Above will login tao docker.
Then generate trt engine and run infernece.

# tao-converter final_model.etlt -k nvidia_tlt -o predictions/Softmax -d 3,224,224 -i nchw -m 64 -e sample_3.0.engine -b 64
# python infer_script.py

There is no problem to load the engine on my side.

thanks Morgan,
The infer_script.py finally works in the suggested docker of nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3 (surely the model.plan is converted inside of it).
The testing data are the same 240 electric-bicycle images which actually copied from part of model training dataset, the results comparison are listed below:

  • By runing infer_script.py inside docker tao-toolkit-tf
    only 68 are correctly recognized as electric-bicycle,the rest 172 are incorrectly recoginized as bicycle.

  • By using triton-server----tao-toolkit-triton-apps
    using image_client.py to call the triton service like:

    python3 image_client.py -m ele_two_vehicle_net_tao ~/Pictures/data/train/electric_bicycle/
    

    from console output result:
    only 64 are correctly recognized as electric-bicycle,the rest 176 are incorrectly recoginized as bicycle.

  • By using the TAO jupter notebook at my training machine
    command is:

    !tao classification inference -e $SPECS_DIR/classification_retrain_spec.cfg \
                              -m $USER_EXPERIMENT_DIR/output_retrain/weights/resnet_$EPOCH.tlt \
                              -k $KEY -b 32 -d $DATA_DOWNLOAD_DIR/split/compare_test/electric_bicycle \
                              -cm $USER_EXPERIMENT_DIR/output_retrain/classmap.json
    

    by checking the result.csv, 239 are correctly recognized as electric-bicycle,only 1 are incorrectly recoginized as bicycle.

could you help, why the accurate is so different?

For your infer_script.py, please modify according to Inferring resnet18 classification etlt model with python - #40 by Morganh

Add below.

from keras.applications.imagenet_utils import preprocess_input

And change

return np.asarray(image.resize((w, h), Image.ANTIALIAS)).transpose([2, 0, 1]).astype(trt.nptype(trt.float32)).ravel()

to

return preprocess_input(np.asarray(image.resize((w, h), Image.ANTIALIAS)).transpose([2, 0, 1]).astype(trt.nptype(trt.float32)), mode=‘caffe’, data_format=‘channels_first’).ravel()

Yes, Morgan, now the infer_script.py show almost the same accuracy as in the training stage!
Could you help to explain a bit for the behaviors here, and what should I do for make triton-server support this accuracy?

The preprocessing should be the culprit in triton inference.

is it controlled by a parameter in triton? could you show me, thanks.

There is no parameter. For preprocess of classification, it is tao-toolkit-triton-apps/tao_client.py at main · NVIDIA-AI-IOT/tao-toolkit-triton-apps · GitHub

thanks.
As you point out at tao-toolkit-triton-apps/tao_client.py at main · NVIDIA-AI-IOT/tao-toolkit-triton-apps · GitHub, the --mode with Classification will trigger the code in else:

else:
    img = frame.load_image()
    repeated_image_data.append(
        triton_model.preprocess(
            frame.as_numpy(img)
        )
    )

And I looked at the triton_model.preprocess(self, image):

def preprocess(self, image):
    """Function to preprocess image
    Performs mean subtraction and then normalization.
    Args:
        image (np.ndarray): Numpy ndarray of an input batch.
    Returns:
        image (np.ndarray): Preprocessed input image.
    """
    image = (image - self.mean) * self.scale
    return image

which unlikely handled the image as you earlier suggested:

return preprocess_input(np.asarray(image.resize((w, h), Image.ANTIALIAS)).transpose([2, 0, 1]).astype(trt.nptype(trt.float32)), mode=‘caffe’, data_format=‘channels_first’).ravel()

Does this mean the Classification sample in tao_client.py also has this issue?

Please modify tao-toolkit-triton-apps/tao_client.py at ae6b5ec41c3a9651957c4dddfc262a43f47e263c · NVIDIA-AI-IOT/tao-toolkit-triton-apps · GitHub

to below and retry.

elif FLAGS.mode.lower() == "multitask_classification" or FLAGS.mode.lower() == "classification":

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.