Triton Server Crashing Running Centerpoint Keypoint (hourglass_512x512_kpts) on Jetson via Dockerized Triton

The Triton server, running in a container, is crashing when running the Centernet Object & KeyPoints Model via grpc as a TF2 tensorflow_savedmodel on the Jetson TX2 and I’m looking for pointers as to how to proceed.

I’m new to Triton/Jetson but after some effort have got an environment that successfully runs Triton examples, namely inception_graphdef and densenet_onnx, via the example grpc_image_client.py and image_client.py python programs.

I have deployed the TF 2.0 hourglass_512x512_kpts model using the default generated pbconfig.txt file (see below) with the addition of a max batch size of 0. The Triton server loads the model and indicates it is ready to serve.

I have made minor changes to the grpc_image_client.py program to adapt to the different dimensionality of the hourglass_512x512_kpts model and invoked it with verbose logging enabled. The request is received, including the desire to only return one of the outputs (I’m trying to build incrementally), but aborts terminating the container (see select Triton logs below).

Any pointers as to how to debug this further?

Environment

cat /etc/nv_tegra_release
# R32 (release), REVISION: 5.2, GCID: 27767740, BOARD: t186ref, EABI: aarch64, DATE: Fri Jul  9 16:02:11 UTC 2021

Triton Logs

...
+----------------------------------+--------------------------------------------------------------------------------------+
| Option                           | Value                                                                                |
+----------------------------------+--------------------------------------------------------------------------------------+
| server_id                        | triton                                                                               |
| server_version                   | 2.11.0                                                                               |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedul |
|                                  | e_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_d |
|                                  | ata statistics                                                                       |
| model_repository_path[0]         | /models                                                                              |
| model_control_mode               | MODE_NONE                                                                            |
| strict_model_config              | 0                                                                                    |
| pinned_memory_pool_byte_size     | 268435456                                                                            |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                             |
| min_supported_compute_capability | 5.3                                                                                  |
| strict_readiness                 | 1                                                                                    |
| exit_timeout                     | 30                                                                                   |
+----------------------------------+--------------------------------------------------------------------------------------+
...
I0105 16:43:36.028928 1 grpc_server.cc:3151] Process for ModelInferHandler, rpc_ok=1, 1 step START
I0105 16:43:36.029217 1 grpc_server.cc:3144] New request handler for ModelInferHandler, 4
I0105 16:43:36.029283 1 model_repository_manager.cc:638] GetInferenceBackend() 'hourglass_512x512_kpts' version 1
I0105 16:43:36.029361 1 model_repository_manager.cc:638] GetInferenceBackend() 'hourglass_512x512_kpts' version 1
I0105 16:43:36.029530 1 infer_request.cc:524] prepared: [0x0x7e38093e80] request id: my request id, model: hourglass_512x512_kpts, requested version: 1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7e38094108] input: input_tensor, type: UINT8, original shape: [1,360,640,3], batch + shape: [1,360,640,3], shape: [1,360,640,3]
override inputs:
inputs:
[0x0x7e38094108] input: input_tensor, type: UINT8, original shape: [1,360,640,3], batch + shape: [1,360,640,3], shape: [1,360,640,3]
original requested outputs:
detection_classes
requested outputs:
detection_classes

I0105 16:43:36.029949 1 tensorflow.cc:2390] model hourglass_512x512_kpts, instance hourglass_512x512_kpts, executing 1 requests
I0105 16:43:36.030034 1 tensorflow.cc:1566] TRITONBACKEND_ModelExecute: Running hourglass_512x512_kpts with 1 requests
I0105 16:43:36.031564 1 tensorflow.cc:1816] TRITONBACKEND_ModelExecute: input 'input_tensor' is GPU tensor: false
2022-01-05 16:43:56.007132: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2022-01-05 16:44:01.398847: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10

The container terminates at this point with no further logging.

pbconfig.txt

name: "hourglass_512x512_kpts"
platform: "tensorflow_savedmodel"
max_batch_size: 0
version_policy {
  latest {
    num_versions: 1
  }
}
input {
  name: "input_tensor"
  data_type: TYPE_UINT8
  dims: 1
  dims: -1
  dims: -1
  dims: 3
}
output {
  name: "detection_boxes"
  data_type: TYPE_FP32
  dims: 1
  dims: 100
  dims: 4
}
output {
  name: "num_detections"
  data_type: TYPE_FP32
  dims: 1
}
output {
  name: "detection_keypoints"
  data_type: TYPE_FP32
  dims: 1
  dims: 100
  dims: 17
  dims: 2
}
output {
  name: "detection_classes"
  data_type: TYPE_FP32
  dims: 1
  dims: 100
}
output {
  name: "detection_keypoint_scores"
  data_type: TYPE_FP32
  dims: 1
  dims: 100
  dims: 17
}
output {
  name: "detection_scores"
  data_type: TYPE_FP32
  dims: 1
  dims: 100
}
instance_group {
  name: "hourglass_512x512_kpts"
  count: 1
  gpus: 0
  kind: KIND_GPU
}
default_model_filename: "model.savedmodel"
optimization {
  input_pinned_memory {
    enable: true
  }
  output_pinned_memory {
    enable: true
  }
}
backend: "tensorflow"

Dockerfile Building Triton Server Image

FROM nvcr.io/nvidia/l4t-ml:r32.5.0-py3

ENV TRITON_SERVER_VERSION=2.11.0
ENV JETPACK_VERSION=4.5

ARG DEBIAN_FRONTEND=noninteractive

WORKDIR /tritonserver

RUN wget https://github.com/triton-inference-server/server/releases/download/v${TRITON_SERVER_VERSION}/tritonserver${TRITON_SERVER_VERSION}-jetpack${JETPACK_VERSION}.tgz && \
    tar -xzf tritonserver${TRITON_SERVER_VERSION}-jetpack${JETPACK_VERSION}.tgz && \
    rm tritonserver${TRITON_SERVER_VERSION}-jetpack${JETPACK_VERSION}.tgz

RUN apt-get update -y

RUN apt-get install -y --no-install-recommends \
        software-properties-common \
        autoconf \
        automake \
        build-essential \
        cmake \
        git \
        libb64-dev \
        libre2-dev \
        libssl-dev \
        libtool \
        libboost-dev \
        libcurl4-openssl-dev \
        rapidjson-dev \
        patchelf \
        zlib1g-dev && \
    rm -rf /var/lib/apt/lists/*

RUN ln -s /tritonserver /opt/tritonserver 

ENV LD_LIBRARY_PATH=/tritonserver/backends/tensorflow2:$LD_LIBRARY_PATH

ENTRYPOINT ["/tritonserver/bin/tritonserver"]
CMD ["--help"]

Further Observations

I have invoked the triton model server docker image with a modified bash entrypoint and then tried to run a number of the tests under ./test-util/bin that are included. Most pass, and some fail for reasons that are obvious (lack of multiple GPUs), but there are several that fail, e.g. [ FAILED ] AllocatedMemoryTest.AllocFallback (1 ms). I am not clear, despite working test programs, as to whether these test failures point to some issue with my deployed environment or are to be expected on the jetson?

Same issue persists despite now properly requesting the TF2 backend explicitly --backend-config=tensorflow,version=2 per Trtsever crashes !! · Issue #1299 · triton-inference-server/server · GitHub

The savedmodel I am using freshly downloaded from TensorFlowHub successfully runs on TensorFlow model server 2.5.1 incidentally. However, I have loaded and re-saved the model in a tensorflow 2.4 (version in the Triton server version I am using) container. This hopefully would ditch anything that tensorflow 2.4 does not understand.

Still looking for bright ideas as to what to try next.

Seems to be a model-specific issue. Downloaded another TF2 object detection w/keypoints model with the same input/output specifications from TensorflowHub (Retinanet) and this has run fine with no issues. I am still eager to know how to debug further on the CenterNet model as it is our primary production model used right now.

Hi,

Would you mind monitoring the device status at the same time?
TensorFlow may fail to allocate enough memory for the model and lead to this error.

$ sudo tegrastats

Thanks.

I can see that memory usage for these object detection models is close to the capacity of the TX2. With the working Retinanet model I have been seeing usage of 6.5G/8.0G, and indeed with a second non-TF2 model loaded in Triton (pushing memory higher even if not invoked) I then saw similar inference crashes on the TF2 model, i.e. with no error messages.

When such crashes do occur I do see freeing of the used memory occur with usage on the unit dropping to 1.3GB. Regardless of this I have then occasionally seen issues on inferences post a server restart as if some resource I’m unaware of is being retained. Are there recommended actions to take post a Triton crash like this to free resources?

My understanding is that the Jetson TX2 memory is shared by the CPU and GPU. Up until now I’ve been leaving memory settings to default to themselves - which means 256BM pinned memory, and 64MB GPU memory.

This post touches on even tighter constraints with Jetson Nano although it doesn’t provide any specific recommendations in terms of Triton server settings to go with viz-a-vis limits:

I’ve not experimented with converting the large object detection models I’m running here which are UINT8 input matrices to TensorRT because a) I’m not sure the conversion is supported with this type, b) one of the main gains seemed to be in going to UINT8 which we have already…

Hi,

Unfortunately, the INT8 operation cannot run on TX2 due to hardware limitations.
Ideally, Triton will use all the available memory on Jetson but this still depends on the backend you used.

Have you checked if the memory required of the model can be reduced?
For example, use lower batch size or input dimension?

Thanks.