the TensortRt server can`t run a pretrained model after converting from onnx to caffe2

1arxemond1 · January 18, 2019, 7:43am

I converted my pretrian model from onnx to caffe2 by this instruction https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/ in the section “ONNX Models” there is a link to this repo https://github.com/pytorch/pytorch/tree/master/caffe2/python/onnx as standard for conversion.

I tried convert by the instruction https://github.com/onnx/tutorials/blob/master/tutorials/Caffe2OnnxExport.ipynb

convert-onnx-to-caffe2 assets/squeezenet.onnx --output predict_net.pb --init-net-output init_net.pb

and this

convert-onnx-to-caffe2 assets/squeezenet.onnx --output predict_net.netdef --init-net-output init_net.netdef

I created a hierarchy as it was written here https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/

/tmp/models/test_pb/
config.pbtxt
1/
  predict_net.pb
  init_net.pb

my for the example above config.pbtxt

name: "test_pb"
platform: "tensorflow_graphdef"
max_batch_size: 128
input [
  {
    name: "input"
    data_type: TYPE_FP32
    format: FORMAT_NHWC
    dims: [ 224, 224, 3 ]
  }
]
output [
  {
    name: "InceptionV3/Predictions/Softmax"
    data_type: TYPE_FP32
    dims: [ 1001 ]
  }
]
instance_group [
  {
    kind: KIND_GPU,
    count: 4
  }
]

and

/tmp/models/test_netdef/
config.pbtxt
1/
  predict_net.netdef
  init_net.netdef

my for the example above config.pbtxt

name: "test_netdef"
platform: "tensorflow_graphdef"
max_batch_size: 128
input [
  {
    name: "input"
    data_type: TYPE_FP32
    format: FORMAT_NHWC
    dims: [ 224, 224, 3 ]
  }
]
output [
  {
    name: "InceptionV3/Predictions/Softmax"
    data_type: TYPE_FP32
    dims: [ 1001 ]
  }
]
instance_group [
  {
    kind: KIND_GPU,
    count: 4
  }
]

but when I run container

nvidia-docker run --rm -p8000:8000 -p8001:8001 -v/tmp/models:/models nvcr.io/nvidia/tensorrtserver:18.09-py3 trtserver --model-store=/models

and send curl I get

ready_state: MODEL_UNAVAILABLE

What do I do incorrectly?

y.glushenkov@ml-test-env:/tmp/models_example$ nvidia-smi
Fri Jan 18 07:43:02 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:00:05.0 Off |                  N/A |
| 35%   55C    P2    76W / 250W |   7344MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      7293      C   trtserver                                    841MiB |
|    0     18379      C   /home/a.eryomin/anaconda3/bin/python        6493MiB |
+-----------------------------------------------------------------------------+

1arxemond1 · January 18, 2019, 7:46am

P.S. your examples from this repo [ur]https://github.com/NVIDIA/tensorrt-inference-server[/url] work

1arxemond1 · January 18, 2019, 7:56am

Also I tried convert from onnx to tensorrt plain, and also have error mentioned in this form issues [url]https://devtalk.nvidia.com/default/topic/1046422/tensorrt/convert-onnx-to-tensorrt-plain/?offset=2#5309945[/url]

NVES · January 18, 2019, 3:49pm

Hello,

A model version ready_state will show up as MODEL_UNAVAILABLE if the model failed to load. Trtserver will log output to the console as it starts so you can see it loading up the different models in your model repository. Please review the log output for indications as to why your model failed to load.

1arxemond1 · January 19, 2019, 2:42pm

Hi, I created 4 path hierarchy as it`s written here https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/
2 of them (netdef and pb) is the same model converted from onnx->pb and onnx->netdef. I got this model from https://mxnet.apache.org/versions/master/tutorials/onnx/export_mxnet_to_onnx.html mxnet->onnx->caffe2 is a standard mxnet model (I tried run this model as a standard mxnet model)
and 2 of them (netdef and pb) is my model converted the same way

y.glushenkov@ml-test-env:~$ ls -la /tmp/models_example/
total 32
drwxrwxr-x  8 y.glushenkov y.glushenkov 4096 Jan 19 14:24 .
drwxrwxrwt 11 root         root         4096 Jan 19 14:29 ..
drwxrwxr-x  3 y.glushenkov y.glushenkov 4096 Jan 17 19:12 inception_graphdef
drwxrwxr-x  3 y.glushenkov y.glushenkov 4096 Jan 19 14:31 mxnet_a_standart_model_netdef
drwxrwxr-x  3 y.glushenkov y.glushenkov 4096 Jan 19 14:31 mxnet_a_standart_model_pb
drwxrwxr-x  3 y.glushenkov y.glushenkov 4096 Jan 19 14:32 mxnet_mymodel_netdef
drwxrwxr-x  3 y.glushenkov y.glushenkov 4096 Jan 19 14:31 mxnet_mymodel_pb
drwxrwxr-x  3 y.glushenkov y.glushenkov 4096 Jan 17 19:07 resnet50_netdef

I converted them as it`s here https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/ in the section of “ONNX Models”

I get next log after run docker container

y.glushenkov@ml-test-env:~$ nvidia-docker run --rm -p8000:8000 -p8001:8001 \
> -v/tmp/models_example:/models nvcr.io/nvidia/tensorrtserver:18.09-py3 trtserver --model-store=/models

===============================
== TensorRT Inference Server ==
===============================

NVIDIA Release 18.09 (build 688039)

Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
Copyright 2018 The TensorFlow Authors.  All rights reserved.

y.glushenkov@ml-test-env:~$ nvidia-docker run --rm -p8000:8000 -p8001:8001 -v/tmp/models_example:/models nvcr.io/nvidia/tensorrtserver:18.09-py3 trtserver --model-store=/models

===============================
== TensorRT Inference Server ==
===============================

NVIDIA Release 18.09 (build 688039)

Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
Copyright 2018 The TensorFlow Authors.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying
project or file.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for the inference server.  NVIDIA recommends the use of the following flags:
   nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 ...

I0119 14:32:10.345343 1 server.cc:631] Initializing TensorRT Inference Server
I0119 14:32:10.345455 1 server.cc:680] Reporting prometheus metrics on port 8002
I0119 14:32:10.348253 1 metrics.cc:129] found 1 GPUs supported power usage metric
I0119 14:32:10.355767 1 metrics.cc:139]   GPU 0: GeForce GTX 1080 Ti
I0119 14:32:10.356479 1 server.cc:884] Starting server 'inference:0' listening on
I0119 14:32:10.356508 1 server.cc:888]  localhost:8001 for gRPC requests
I0119 14:32:10.357385 1 server.cc:898]  localhost:8000 for HTTP requests
[warn] getaddrinfo: address family for nodename not supported
[evhttp_server.cc : 235] RAW: Entering the event loop ...
I0119 14:32:10.387799 1 server_core.cc:465] Adding/updating models.
I0119 14:32:10.387837 1 server_core.cc:520]  (Re-)adding model: inception_graphdef
I0119 14:32:10.387858 1 server_core.cc:520]  (Re-)adding model: mxnet_a_standart_model_netdef
I0119 14:32:10.387877 1 server_core.cc:520]  (Re-)adding model: mxnet_a_standart_model_pb
I0119 14:32:10.387900 1 server_core.cc:520]  (Re-)adding model: mxnet_mymodel_netdef
I0119 14:32:10.387922 1 server_core.cc:520]  (Re-)adding model: mxnet_mymodel_pb
I0119 14:32:10.387940 1 server_core.cc:520]  (Re-)adding model: resnet50_netdef
I0119 14:32:10.488538 1 basic_manager.cc:739] Successfully reserved resources to load servable {name: inception_graphdef version: 1}
I0119 14:32:10.488620 1 loader_harness.cc:66] Approving load for servable version {name: inception_graphdef version: 1}
I0119 14:32:10.488686 1 loader_harness.cc:74] Loading servable version {name: inception_graphdef version: 1}
I0119 14:32:10.497107 1 base_bundle.cc:180] Creating instance inception_graphdef_0_0_gpu0 on GPU 0 (6.1) using model.graphdef
I0119 14:32:10.588549 1 basic_manager.cc:739] Successfully reserved resources to load servable {name: mxnet_a_standart_model_netdef version: 1}
I0119 14:32:10.588615 1 loader_harness.cc:66] Approving load for servable version {name: mxnet_a_standart_model_netdef version: 1}
I0119 14:32:10.588656 1 loader_harness.cc:74] Loading servable version {name: mxnet_a_standart_model_netdef version: 1}
I0119 14:32:10.688999 1 basic_manager.cc:739] Successfully reserved resources to load servable {name: resnet50_netdef version: 1}
I0119 14:32:10.689062 1 loader_harness.cc:66] Approving load for servable version {name: resnet50_netdef version: 1}
I0119 14:32:10.689104 1 loader_harness.cc:74] Loading servable version {name: resnet50_netdef version: 1}
E0119 14:32:10.738151 1 retrier.cc:37] Loading servable: {name: mxnet_a_standart_model_netdef version: 1} failed: Not found: /models/mxnet_a_standart_model_netdef/resnet50_labels.txt; No such file or directory
I0119 14:32:10.738240 1 loader_harness.cc:154] Encountered an error for servable version {name: mxnet_a_standart_model_netdef version: 1}: Not found: /models/mxnet_a_standart_model_netdef/resnet50_labels.txt; No such file or directory
E0119 14:32:10.738269 1 aspired_versions_manager.cc:358] Servable {name: mxnet_a_standart_model_netdef version: 1} cannot be loaded: Not found: /models/mxnet_a_standart_model_netdef/resnet50_labels.txt; No such file or directory
I0119 14:32:10.788915 1 basic_manager.cc:739] Successfully reserved resources to load servable {name: mxnet_mymodel_pb version: 1}
I0119 14:32:10.788994 1 loader_harness.cc:66] Approving load for servable version {name: mxnet_mymodel_pb version: 1}
I0119 14:32:10.789065 1 loader_harness.cc:74] Loading servable version {name: mxnet_mymodel_pb version: 1}
I0119 14:32:10.889892 1 basic_manager.cc:739] Successfully reserved resources to load servable {name: mxnet_mymodel_netdef version: 1}
I0119 14:32:10.889948 1 loader_harness.cc:66] Approving load for servable version {name: mxnet_mymodel_netdef version: 1}
I0119 14:32:10.889989 1 loader_harness.cc:74] Loading servable version {name: mxnet_mymodel_netdef version: 1}
I0119 14:32:10.890490 1 cuda_gpu_executor.cc:890] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I0119 14:32:10.895133 1 gpu_device.cc:1405] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:00:05.0
totalMemory: 10.92GiB freeMemory: 4.42GiB
I0119 14:32:10.895185 1 gpu_device.cc:1484] Adding visible gpu devices: 0
I0119 14:32:10.989345 1 basic_manager.cc:739] Successfully reserved resources to load servable {name: mxnet_a_standart_model_pb version: 1}
I0119 14:32:10.989424 1 loader_harness.cc:66] Approving load for servable version {name: mxnet_a_standart_model_pb version: 1}
I0119 14:32:10.989470 1 loader_harness.cc:74] Loading servable version {name: mxnet_a_standart_model_pb version: 1}
E0119 14:32:11.690404 1 retrier.cc:37] Loading servable: {name: mxnet_a_standart_model_pb version: 1} failed: Not found: /models/mxnet_a_standart_model_pb/resnet50_labels.txt; No such file or directory
I0119 14:32:11.690516 1 loader_harness.cc:154] Encountered an error for servable version {name: mxnet_a_standart_model_pb version: 1}: Not found: /models/mxnet_a_standart_model_pb/resnet50_labels.txt; No such file or directory
E0119 14:32:11.690544 1 aspired_versions_manager.cc:358] Servable {name: mxnet_a_standart_model_pb version: 1} cannot be loaded: Not found: /models/mxnet_a_standart_model_pb/resnet50_labels.txt; No such file or directory
E0119 14:32:11.743922 1 retrier.cc:37] Loading servable: {name: mxnet_mymodel_pb version: 1} failed: Not found: /models/mxnet_mymodel_pb/resnet50_labels.txt; No such file or directory
I0119 14:32:11.743998 1 loader_harness.cc:154] Encountered an error for servable version {name: mxnet_mymodel_pb version: 1}: Not found: /models/mxnet_mymodel_pb/resnet50_labels.txt; No such file or directory
E0119 14:32:11.744019 1 aspired_versions_manager.cc:358] Servable {name: mxnet_mymodel_pb version: 1} cannot be loaded: Not found: /models/mxnet_mymodel_pb/resnet50_labels.txt; No such file or directory
I0119 14:32:11.821870 1 netdef_bundle.cc:170] Creating instance resnet50_netdef_0_0_gpu0 on GPU 0 (6.1) using init_model.netdef and model.netdef
E0119 14:32:12.688776 1 retrier.cc:37] Loading servable: {name: mxnet_mymodel_netdef version: 1} failed: Not found: /models/mxnet_mymodel_netdef/resnet50_labels.txt; No such file or directory
I0119 14:32:12.688852 1 loader_harness.cc:154] Encountered an error for servable version {name: mxnet_mymodel_netdef version: 1}: Not found: /models/mxnet_mymodel_netdef/resnet50_labels.txt; No such file or directory
E0119 14:32:12.688893 1 aspired_versions_manager.cc:358] Servable {name: mxnet_mymodel_netdef version: 1} cannot be loaded: Not found: /models/mxnet_mymodel_netdef/resnet50_labels.txt; No such file or directory
W0119 14:32:13.626714 1 init.h:99] Caffe2 GlobalInit should be run before any other API calls.
I0119 14:32:14.104976 1 gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
I0119 14:32:14.105053 1 gpu_device.cc:971]      0 
I0119 14:32:14.105084 1 gpu_device.cc:984] 0:   N 
I0119 14:32:14.106044 1 gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4124 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:00:05.0, compute capability: 6.1)
I0119 14:32:15.959289 1 base_bundle.cc:180] Creating instance inception_graphdef_0_1_gpu0 on GPU 0 (6.1) using model.graphdef
I0119 14:32:15.959389 1 gpu_device.cc:1484] Adding visible gpu devices: 0
I0119 14:32:15.959442 1 gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
I0119 14:32:15.959459 1 gpu_device.cc:971]      0 
I0119 14:32:15.959474 1 gpu_device.cc:984] 0:   N 
I0119 14:32:15.960861 1 gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4124 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:00:05.0, compute capability: 6.1)
I0119 14:32:17.210765 1 base_bundle.cc:180] Creating instance inception_graphdef_0_2_gpu0 on GPU 0 (6.1) using model.graphdef
I0119 14:32:17.210884 1 gpu_device.cc:1484] Adding visible gpu devices: 0
I0119 14:32:17.210921 1 gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
I0119 14:32:17.210939 1 gpu_device.cc:971]      0 
I0119 14:32:17.210955 1 gpu_device.cc:984] 0:   N 
I0119 14:32:17.211488 1 gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4124 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:00:05.0, compute capability: 6.1)
I0119 14:32:18.225171 1 base_bundle.cc:180] Creating instance inception_graphdef_0_3_gpu0 on GPU 0 (6.1) using model.graphdef
I0119 14:32:18.225274 1 gpu_device.cc:1484] Adding visible gpu devices: 0
I0119 14:32:18.225311 1 gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
I0119 14:32:18.225329 1 gpu_device.cc:971]      0 
I0119 14:32:18.225346 1 gpu_device.cc:984] 0:   N 
I0119 14:32:18.225873 1 gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4124 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:00:05.0, compute capability: 6.1)
I0119 14:32:19.583308 1 loader_harness.cc:86] Successfully loaded servable version {name: inception_graphdef version: 1}
W0119 14:32:20.007595 1 init.h:99] Caffe2 GlobalInit should be run before any other API calls.
I0119 14:32:20.613882 1 netdef_bundle.cc:170] Creating instance resnet50_netdef_0_1_gpu0 on GPU 0 (6.1) using init_model.netdef and model.netdef
I0119 14:32:21.323094 1 netdef_bundle.cc:170] Creating instance resnet50_netdef_0_2_gpu0 on GPU 0 (6.1) using init_model.netdef and model.netdef
I0119 14:32:22.080661 1 netdef_bundle.cc:170] Creating instance resnet50_netdef_0_3_gpu0 on GPU 0 (6.1) using init_model.netdef and model.netdef
I0119 14:32:22.778145 1 loader_harness.cc:86] Successfully loaded servable version {name: resnet50_netdef version: 1}

and all of the have state “ready_state: MODEL_UNAVAILABLE”

1arxemond1 · January 19, 2019, 2:47pm

P.S. dir names on the screen above have suffix “mxnet”, actually in the dirs were put models converted mxnet->onnx->caffe2. There arent any mxnets models, it appear only in the name of this dirs

1arxemond1 · January 19, 2019, 3:33pm

Oh, I forgot delete in the configs file

label_filename:

. I did it and now I have next error

y.glushenkov@ml-test-env:~$ nvidia-docker run --rm -p8000:8000 -p8001:8001 -v/tmp/mymodels_new:/models nvcr.io/nvidia/tensorrtserver:18.09-py3 trtserver --model-store=/models

===============================
== TensorRT Inference Server ==
===============================

NVIDIA Release 18.09 (build 688039)

Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
Copyright 2018 The TensorFlow Authors.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying
project or file.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for the inference server.  NVIDIA recommends the use of the following flags:
   nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 ...

I0119 15:27:25.298543 1 server.cc:631] Initializing TensorRT Inference Server
I0119 15:27:25.298716 1 server.cc:680] Reporting prometheus metrics on port 8002
I0119 15:27:25.301633 1 metrics.cc:129] found 1 GPUs supported power usage metric
I0119 15:27:25.309114 1 metrics.cc:139]   GPU 0: GeForce GTX 1080 Ti
I0119 15:27:25.309730 1 server.cc:884] Starting server 'inference:0' listening on
I0119 15:27:25.309767 1 server.cc:888]  localhost:8001 for gRPC requests
I0119 15:27:25.311020 1 server.cc:898]  localhost:8000 for HTTP requests
[warn] getaddrinfo: address family for nodename not supported
[evhttp_server.cc : 235] RAW: Entering the event loop ...
I0119 15:27:25.338489 1 server_core.cc:465] Adding/updating models.
I0119 15:27:25.338521 1 server_core.cc:520]  (Re-)adding model: mxnet_a_standart_model_netdef
I0119 15:27:25.439051 1 basic_manager.cc:739] Successfully reserved resources to load servable {name: mxnet_a_standart_model_netdef version: 1}
I0119 15:27:25.439135 1 loader_harness.cc:66] Approving load for servable version {name: mxnet_a_standart_model_netdef version: 1}
I0119 15:27:25.439225 1 loader_harness.cc:74] Loading servable version {name: mxnet_a_standart_model_netdef version: 1}
I0119 15:27:25.590002 1 netdef_bundle.cc:170] Creating instance mxnet_a_standart_model_netdef_0_0_gpu0 on GPU 0 (6.1) using init_model.netdef and model.netdef
W0119 15:27:25.758596 1 init.h:99] Caffe2 GlobalInit should be run before any other API calls.
W0119 15:27:27.910249 1 init.h:99] Caffe2 GlobalInit should be run before any other API calls.
E0119 15:27:28.253048 1 operator_schema.cc:82] Argument 'is_test' is required for Operator 'SpatialBN'.
terminate called after throwing an instance of 'at::Error'
  what():  [enforce fail at operator.cc:135] schema->Verify(operator_def). Operator def did not pass schema checking: input: "data" input: "bn_data_gamma" input: "bn_data_beta" input: "bn_data_moving_mean" input: "bn_data_moving_var" output: "bn_data" name: "bn_data" type: "SpatialBN" arg { name: "epsilon" f: 2e-05 } arg { name: "momentum" f: 0.9 } arg { name: "spatial" i: 0 } device_option { device_type: 1 cuda_gpu_id: 0 }

I have the same error before, when I typed by hand a config. But after it I copy-pasted from the example, I changed name (dir) in a config file, but forget about

label_filename:

1arxemond1 · February 18, 2019, 11:44am

This issues is still actual.