Inferencing on DINO in triton inference server

Description

I am trying to initialize triton inference server for dino engine file and I am looking for any example references for the dino model on triton inference server, Thanks.

I tried to run the engine with what I found in docs and ended up in the following error.

Environment

GPU Type: Tesla T4
Nvidia Driver Version: 535.171.04
CUDA Version: 12.2
Operating System + Version: Ubuntu 22.04.4 LTS
Python Version (if applicable): 3.10.12
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag): tritonserver:24.07-py3

Relevant Files

model repo structure
image

Steps To Reproduce

Docker run command
sudo docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:24.07-py3 tritonserver --model-repository=/models

Error thrown,

Complete Error Logs

=============================
== Triton Inference Server ==
=============================

NVIDIA Release 24.07 (build 101814614)
Triton Server Version 2.48.0

Copyright (c) 2018-2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 12.5 driver version 555.42.06 with kernel driver version 535.171.04.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

I0804 14:46:20.658159 1 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x70b29c000000' with size 268435456"
I0804 14:46:20.661299 1 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I0804 14:46:20.666066 1 model_lifecycle.cc:472] "loading: dino:1"
I0804 14:46:20.684498 1 tensorrt.cc:65] "TRITONBACKEND_Initialize: tensorrt"
I0804 14:46:20.684531 1 tensorrt.cc:75] "Triton TRITONBACKEND API version: 1.19"
I0804 14:46:20.684537 1 tensorrt.cc:81] "'tensorrt' TRITONBACKEND API version: 1.19"
I0804 14:46:20.684542 1 tensorrt.cc:105] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"true\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
I0804 14:46:20.687355 1 tensorrt.cc:231] "TRITONBACKEND_ModelInitialize: dino (version 1)"
I0804 14:46:20.816988 1 logging.cc:46] "Loaded engine size: 238 MiB"
E0804 14:46:20.859823 1 logging.cc:40] "IRuntime::deserializeCudaEngine: Error Code 1: Serialization (Serialization assertion stdVersionRead == kSERIALIZATION_VERSION failed.Version tag does not match. Note: Current Version: 238, Serialized Engine Version: 236)"
I0804 14:46:20.863841 1 tensorrt.cc:274] "TRITONBACKEND_ModelFinalize: delete model state"
E0804 14:46:20.863867 1 model_lifecycle.cc:641] "failed to load 'dino' version 1: Internal: unable to load plan file to auto complete config: /models/dino/1/model.plan"
I0804 14:46:20.863878 1 model_lifecycle.cc:776] "failed to load 'dino'"
I0804 14:46:20.863945 1 server.cc:604] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0804 14:46:20.864014 1 server.cc:631] 
+----------+-----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend  | Path                                                      | Config                                                                                                                                                        |
+----------+-----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tensorrt | /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+----------+-----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0804 14:46:20.864075 1 server.cc:674] 
+-------+---------+----------------------------------------------------------------------------------------------------+
| Model | Version | Status                                                                                             |
+-------+---------+----------------------------------------------------------------------------------------------------+
| dino  | 1       | UNAVAILABLE: Internal: unable to load plan file to auto complete config: /models/dino/1/model.plan |
+-------+---------+----------------------------------------------------------------------------------------------------+

I0804 14:46:20.896006 1 metrics.cc:877] "Collecting metrics for GPU 0: Tesla T4"
I0804 14:46:20.902783 1 metrics.cc:770] "Collecting CPU metrics"
I0804 14:46:20.902958 1 tritonserver.cc:2598] 
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                           |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                                          |
| server_version                   | 2.48.0                                                                                                                                                                                                          |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0]         | /models                                                                                                                                                                                                         |
| model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
| strict_model_config              | 0                                                                                                                                                                                                               |
| model_config_name                |                                                                                                                                                                                                                 |
| rate_limit                       | OFF                                                                                                                                                                                                             |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
| strict_readiness                 | 1                                                                                                                                                                                                               |
| exit_timeout                     | 30                                                                                                                                                                                                              |
| cache_enabled                    | 0                                                                                                                                                                                                               |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0804 14:46:20.903017 1 server.cc:305] "Waiting for in-flight requests to complete."
I0804 14:46:20.903023 1 server.cc:321] "Timeout 30: Found 0 model versions that have in-flight inferences"
I0804 14:46:20.903064 1 server.cc:336] "All models are stopped, unloading models"
I0804 14:46:20.903095 1 server.cc:345] "Timeout 30: Found 0 live models and 0 in-flight non-inference requests"
error: creating server: Internal - failed to load all models

Hi @sanujen.20 ,
Apologies for delay, but would recommend raising the issue to Issues · triton-inference-server/server · GitHub
Thanks