ROS2 Launch Crash Randomly

I am deploying a set of robots utilsiing the Isaac ROS packages, including isaac_ros_dnn_image_encoder and isaac_ros_tensor_rt.

I am using an Nvidia Orin Dev kit 64GB with an m.2 installed
Model: Jetson AGX Orin Developer Kit - Jetpack 5.1.2 [L4T 35.4.1]

Libraries:
CUDA: 11.4.315
CUDNN: 8.6.0.166
TensorRT: 8.5.2.2
VPI: 2.3.9
Vulkan: 1.3.204
OpenCV: 4.5.4 with CUDA:NO

On launch there are 2 sets of AI image processing/inference containers that spin up.

~30% of the time, a component will crash on launch:

e.g.

1709515396.9951015 [component_container_mt-11] NvMMLiteOpen : Block : BlockType = 261 
1709515397.0986693 [component_container_mt-11] NvMMLiteBlockCreate : Block : BlockType = 261 
1709515397.1022320 [component_container_mt-11] [INFO] [1709515397.101109772] [abc.panorama_server.video_h264_decoder]: [NitrosContext] Running application...
1709515397.1090574 [component_container_mt-11] [INFO] [1709515397.104371090] [abc.panorama_server.video_h264_decoder]: [NitrosNode] Starting a heartbeat timer (eid=17)
1709515397.1104555 [component_container_mt-11] [INFO] [1709515397.104604756] [abc.panorama_server.video_resize_node]: [NitrosContext] Loading application: '/tmp/isaac_ros_nitros/graphs/RUKDNOJEZN/RUKDNOJEZN.yaml'
1709515397.1112237 [component_container_mt-11] [INFO] [1709515397.104717110] [abc.panorama_server.video_dnn_encoder]: [NitrosNode] Initializing NitrosNode
1709515397.1119342 [component_container_mt-11] [INFO] [1709515397.105246908] [abc.panorama_server.video_h264_decoder]: Negotiating
1709515397.1126776 [component_container_mt-11] [INFO] [1709515397.106614124] [abc.panorama_server.video_dnn_encoder]: [NitrosNode] Starting NitrosNode
1709515397.1133666 [component_container_mt-11] [INFO] [1709515397.106669132] [abc.panorama_server.video_dnn_encoder]: [NitrosNode] Loading built-in preset extension specs
1709515397.1140604 [component_container_mt-11] e[1;31m2024-03-04 14:23:17.108 ERROR gxf/std/type_registry.cpp@48: Unknown type: nvidia::gxf::TensorRtInferencee[0m
1709515397.1147683 [component_container_mt-11] e[1;31m2024-03-04 14:23:17.108 ERROR gxf/std/yaml_file_loader.cpp@399: Could not add component of type 'nvidia::gxf::TensorRtInference' to entity.e[0m
1709515397.1154776 [component_container_mt-11] [ERROR] [1709515397.108336480] [abc.panorama_server.video_resize_node]: [NitrosNode] LoadApplication Error: GXF_FACTORY_UNKNOWN_CLASS_NAME
1709515397.1166997 [component_container_mt-11] terminate called after throwing an instance of 'std::runtime_error'
1709515397.1174448 [component_container_mt-11]   what():  [NitrosNode] LoadApplication Error: GXF_FACTORY_UNKNOWN_CLASS_NAME
1709515397.2644863 [foxglove_bridge-1] [INFO] [1709515397.261071500] [abc.foxglove_bridge]: Subscribing to topic "/abc/detection_server/rts_image/apriltag_image_annotations" (foxglove_msgs/msg/ImageAnnotations) on channel 36
1709515397.2692885 [foxglove_bridge-1] [INFO] [1709515397.268544739] [abc.foxglove_bridge]: Subscribing to topic "/abc/detection_server/rts_image/bbox_image_annotations" (foxglove_msgs/msg/ImageAnnotations) on channel 35
1709515397.6883087 [detection_server-5] [INFO] [1709515397.687620387] [abc.detection_server]: Initialising Detection Server.
1709515397.6904640 [detection_server-5] [INFO] [1709515397.690162688] [abc.detection_server]: Detection Service Initialised.
1709515397.9373837 [ERROR] [component_container_mt-11]: process has died [pid 29954, exit code -6, cmd '/opt/ros/humble/lib/rclcpp_components/component_container_mt --ros-args -r __node:=tensor_rt_container -r __ns:=/abc/panorama_server'].

or another example:

abc-ai-run  | [component_container_mt-7] [INFO] [1709676265.593694817] [abc.detection_server.rts_image_dnn_encoder]: [NitrosContext] Running application...
abc-ai-run  | [component_container_mt-7] [INFO] [1709676265.595963651] [abc.detection_server.tensor_rt]: [NitrosContext] Loading application: '/tmp/isaac_ros_nitros/graphs/NYTVHSFZKR/NYTVHSFZKR.yaml'
abc-ai-run  | [component_container_mt-7] [INFO] [1709676265.606229519] [abc.detection_server.tensor_rt]: [NitrosNode] Linking Nitros pub/sub to the loaded application
abc-ai-run  | [component_container_mt-7] [ERROR] [1709676265.608126001] [abc.detection_server.tensor_rt]: [NitrosContext] GXFEntityFind Error: GXF_ENTITY_NOT_FOUND
abc-ai-run  | [component_container_mt-7] [ERROR] [1709676265.608557010] [abc.detection_server.tensor_rt]: [NitrosContext] getCid Error: GXF_ENTITY_NOT_FOUND
abc-ai-run  | [component_container_mt-7] [ERROR] [1709676265.608599794] [abc.detection_server.tensor_rt]: [NitrosNode] Failed to get the pointer of nvidia::gxf::DoubleBufferReceiver (inference/rx) for linking a NitrosSubscriber: GXF_ENTITY_NOT_FOUND
abc-ai-run  | [component_container_mt-7] terminate called after throwing an instance of 'std::runtime_error'
abc-ai-run  | [component_container_mt-7]   what():  [NitrosNode] Failed to get the pointer of nvidia::gxf::DoubleBufferReceiver (inference/rx) for linking a NitrosSubscriber: GXF_ENTITY_NOT_FOUND
abc-ai-run  | [detection_server-5] [INFO] [1709676266.443703563] [abc.detection_server]: Initialising Detection Server.
abc-ai-run  | [detection_server-5] [INFO] [1709676266.516133822] [abc.detection_server]: Detection Service Initialised.
abc-ai-run  | [ERROR] [component_container_mt-7]: process has died [pid 29149, exit code -6, cmd '/opt/ros/humble/lib/rclcpp_components/component_container_mt --ros-args -r __node:=tensor_rt_container -r __ns:=/abc/detection_server'].

Simply relaunching the container will eventually allow it to run without errors.

I am launching via a docker compose file based on the nvcr.io/nvidia/isaac/ros:aarch64-ros2_humble image,

FROM nvcr.io/nvidia/isaac/ros:aarch64-ros2_humble_b7e1ed6c02a6fa3c1c7392479291c035

...
...

RUN apt-get update && apt-get install -y \ 
    ros-humble-isaac-ros-common \
    ros-humble-isaac-ros-dnn-image-encoder \
    ros-humble-isaac-ros-tensor-rt \
    ros-humble-isaac-ros-h264-decoder \
    ros-humble-isaac-ros-image-pipeline \
    ros-humble-isaac-ros-nitros \

the following volumes are mounted into the container

    volumes:
      - ${HOME}/.Xauthority:/home/admin/.Xauthority:rw
      - /dev/*:/dev/*
      - /etc/localtime:/etc/localtime:ro
      - /usr/bin/tegrastats:/usr/bin/tegrastats
      - /tmp/argus_socket:/tmp/argus_socket
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcusolver.so.11:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libcusolver.so.11
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcusparse.so.11:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libcusparse.so.11
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcurand.so.10:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libcurand.so.10
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libnvToolsExt.so:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libnvToolsExt.so
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcupti.so.11.4:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libcupti.so.11.4
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcudla.so.1:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libcudla.so.1
      - /usr/local/cuda-11.4/targets/aarch64-linux/include/nvToolsExt.h:/usr/local/cuda-11.4/targets/aarch64-linux/include/nvToolsExt.h
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcufft.so.10:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libcufft.so.10
      - /usr/lib/aarch64-linux-gnu/tegra:/usr/lib/aarch64-linux-gnu/tegra
      - /usr/src/jetson_multimedia_api:/usr/src/jetson_multimedia_api
      - /opt/nvidia/nsight-systems-cli:/opt/nvidia/nsight-systems-cli
      - /opt/nvidia/vpi2:/opt/nvidia/vpi2
      - /usr/share/vpi2:/usr/share/vpi2

A snippet from the launch file:

    h264_decoder = ComposableNode(
        name="video_h264_decoder",
        package="isaac_ros_h264_decoder",
        namespace=[LaunchConfiguration("ns"), "/panorama_server"],
        plugin="nvidia::isaac_ros::h264_decoder::DecoderNode",
        parameters=[
            {
                "input_height": 1080,
                "input_width": 1920,
            }
        ],
        remappings=[
            (
                "image_compressed",
                ["/", LaunchConfiguration("ns"), "/", VIDEO_INPUT_TOPIC, "/", "h264"],
            ),
            (
                "image_uncompressed",
                ["/", LaunchConfiguration("ns"), "/", VIDEO_INPUT_TOPIC],
            ),
        ],
    )

    image_encoder_node = ComposableNode(
        name="video_dnn_encoder",
        namespace=[LaunchConfiguration("ns"), "/panorama_server"],
        package="isaac_ros_dnn_image_encoder",
        plugin="nvidia::isaac_ros::dnn_inference::DnnImageEncoderNode",
        parameters=[
            {
                "input_image_width" : model_dimension_width,
                "input_image_height" : model_dimension_height,
                "network_image_width": model_dimension_width,
                "network_image_height": model_dimension_height,
                "image_mean": [0.0, 0.0, 0.0],
                "image_stddev": [
                    PIXEL_SCALE_INVERSE,
                    PIXEL_SCALE_INVERSE,
                    PIXEL_SCALE_INVERSE,
                ],
            }
        ],
        remappings=[
            ("encoded_tensor", "tensor_pub"),
            ("image", ["/", LaunchConfiguration("ns"), "/", VIDEO_INPUT_TOPIC]),
        ],
    )

    image_resize_node = ComposableNode(
        name="video_resize_node",
        namespace=[LaunchConfiguration("ns"), "/panorama_server"],
        package="isaac_ros_image_proc",
        plugin="nvidia::isaac_ros::image_proc::ResizeNode",
        parameters=[
            {
                "output_height" : model_dimension_height,
                "output_width" : model_dimension_width,
                "keep_aspect_ratio": False,
            }
        ],
        remappings=[
            ("image", ["/", LaunchConfiguration("ns"), "/", VIDEO_INPUT_TOPIC]),
            ("camera_info", ["/", LaunchConfiguration("ns"), "/camera_info"]),
            ("resize/image", ["/", LaunchConfiguration("ns"), "/", VIDEO_INPUT_TOPIC+"_resized"]),
            ("resize/camera_info", ["/", LaunchConfiguration("ns"), "/camera_info"+"_resized"]),
        ],
    )

    tensorrt_inference_node = ComposableNode(
        name="tensor_rt",
        namespace=[LaunchConfiguration("ns"), "/panorama_server"],
        package="isaac_ros_tensor_rt",
        plugin="nvidia::isaac_ros::dnn_inference::TensorRTNode",
        parameters=[
            {
                "engine_file_path": model_engine_file_path,
                "output_binding_names": [
                    "num_detections",
                    "detection_boxes",
                    "detection_scores",
                    "detection_classes",
                ],
                "output_tensor_names": [
                    "num_detections",
                    "detection_boxes",
                    "detection_scores",
                    "detection_classes",
                ],
                "input_tensor_names": ["input_tensor"],
                "input_binding_names": ["input"],
                "force_engine_update": False,
            }
        ],
    )

    video_inference_container = ComposableNodeContainer(
        name="tensor_rt_container",
        namespace=[LaunchConfiguration("ns"), "/panorama_server"],
        package="rclcpp_components",
        executable="component_container_mt",
        # The h264 image is received. It is then h264 decoded, resized, tensor encoded, and then inferenced.
        composable_node_descriptions=[
            h264_decoder,
            image_resize_node,
            image_encoder_node,
            tensorrt_inference_node,
        ],
    )

Hi @tom_grimwood

You designed a new Docker container, but looking at your configuration and logs, I don’t see any relevant bug or error.
Please, if you can, share part of your Dockerfile so we can figure out where this crash is coming from.

Raffaello

Hey Rafaello, Here is the full DockerFile.

FROM nvcr.io/nvidia/isaac/ros:aarch64-ros2_humble_b7e1ed6c02a6fa3c1c7392479291c035

# Setup non-root admin user
ARG USERNAME
ARG USER_UID=1000
ARG USER_GID=1000

# disable terminal interaction for apt
ENV DEBIAN_FRONTEND=noninteractive
ENV SHELL /bin/bash
SHELL ["/bin/bash", "-c"]

# Env setup
RUN locale-gen en_US en_US.UTF-8
RUN update-locale LC_ALL=en_US.UTF-8 LANG=en_US.UTF-8
ENV LANG=en_US.UTF-8
ENV ROS_PYTHON_VERSION=3
ENV ROS_DISTRO=humble
ENV ROS_ROOT=/opt/ros/${ROS_DISTRO}

# Install Isaac ROS packages
RUN apt-get update && apt-get install -y \ 
    ros-humble-isaac-ros-common \
    ros-humble-isaac-ros-dnn-image-encoder \
    ros-humble-isaac-ros-tensor-rt \
    ros-humble-isaac-ros-h264-decoder \
    ros-humble-isaac-ros-image-pipeline \
    ros-humble-isaac-ros-nitros \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean

# Install foxglove deps.
RUN apt-get update && apt-get install -y \
        ros-humble-foxglove-msgs \
        ros-humble-foxglove-bridge \
        libwebsocketpp-dev \   
        ros-humble-ament-cmake-clang-format \
        ros-humble-resource-retriever \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean


RUN pip3 install transforms3d

# a dependency
RUN apt-get update && apt-get install -y \
        ros-humble-tf-transformations \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean
    


# Install AprilTag and mcap deps.
RUN apt-get update && apt-get install -y \
        ros-humble-apriltag-msgs \
        ros-humble-apriltag \
        ros-humble-rosbag2-storage-mcap \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean


# Copy source files
COPY ai/src /workspaces/isaac_ros-dev/src


# Build non Isaac Ros packages from source
RUN apt-get update \
    && source ${ROS_ROOT}/setup.bash && cd /workspaces/isaac_ros-dev/src \
    && cd /workspaces/isaac_ros-dev \
    && rosdep install -y -r --ignore-src --from-paths src --rosdistro ${ROS_DISTRO} \
    && colcon build --merge-install --cmake-args -DCMAKE_BUILD_TYPE=RelWithDebInfo \
    && rm -Rf src build log \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean

# Install prerequisites
RUN apt-get update && apt-get install -y \
        sudo \
        udev \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get clean

# Reuse triton-server user as 'admin' user if exists
RUN if [ $(getent group triton-server) ]; then \
        groupmod -o --gid ${USER_GID} -n ${USERNAME} triton-server ; \
        usermod -l ${USERNAME} -u ${USER_UID} -m -d /home/${USERNAME} triton-server ; \
        mkdir -p /home/${USERNAME} ; \
        sudo chown ${USERNAME}:${USERNAME} /home/${USERNAME} ; \
    fi

# Create the 'admin' user if not already exists
RUN if [ ! $(getent passwd ${USERNAME}) ]; then \
        groupadd --gid ${USER_GID} ${USERNAME} ; \
        useradd --uid ${USER_UID} --gid ${USER_GID} -m ${USERNAME} ; \
    fi

# Update 'admin' user
RUN echo ${USERNAME} ALL=\(root\) NOPASSWD:ALL > /etc/sudoers.d/${USERNAME} \
    && chmod 0440 /etc/sudoers.d/${USERNAME} \
    && adduser ${USERNAME} video && adduser ${USERNAME} plugdev && adduser ${USERNAME} sudo


RUN mkdir -p /usr/local/share/middleware_profiles
RUN mkdir -p /usr/local/share/mcap_profiles
COPY middleware_profiles/*profile.xml /usr/local/share/middleware_profiles/
COPY mcap-config.yaml /usr/local/share/mcap_profiles/
ENV USERNAME=${USERNAME}
ENV USER_GID=${USER_GID}
ENV USER_UID=${USER_UID}

Here is the docker compose

version: '3.9'
services:

  ros_humble:
    build:
      dockerfile: docker/humble.jetson.Dockerfile
      args:
        USERNAME: admin
    image: ai_run
    runtime: nvidia
    container_name: ros_humble
    privileged: true
    network_mode: host
    pid: "host"
    ipc: host
    user: admin
    stdin_open: true
    tty: true
    working_dir: /workspaces/isaac_ros-dev
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=all

    volumes:
      - ${HOME}/.Xauthority:/home/admin/.Xauthority:rw
      - /dev/*:/dev/*
      - /etc/localtime:/etc/localtime:ro
      - /usr/bin/tegrastats:/usr/bin/tegrastats
      - /tmp/argus_socket:/tmp/argus_socket
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcusolver.so.11:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libcusolver.so.11
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcusparse.so.11:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libcusparse.so.11
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcurand.so.10:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libcurand.so.10
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libnvToolsExt.so:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libnvToolsExt.so
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcupti.so.11.4:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libcupti.so.11.4
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcudla.so.1:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libcudla.so.1
      - /usr/local/cuda-11.4/targets/aarch64-linux/include/nvToolsExt.h:/usr/local/cuda-11.4/targets/aarch64-linux/include/nvToolsExt.h
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcufft.so.10:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libcufft.so.10
      - /usr/lib/aarch64-linux-gnu/tegra:/usr/lib/aarch64-linux-gnu/tegra
      - /usr/src/jetson_multimedia_api:/usr/src/jetson_multimedia_api
      - /opt/nvidia/nsight-systems-cli:/opt/nvidia/nsight-systems-cli
      - /opt/nvidia/vpi2:/opt/nvidia/vpi2
      - /usr/share/vpi2:/usr/share/vpi2

I will also try it on a different Orin today and see if the issue persists

I think I have fixed the issue, my guess is that I was launching 2 sets of composable nodes, with lots of Isaac ROS nodes within, I have now added a small delay between the launching of the two, and the issue has resolved itself.

I guessed this may be a fix because there would only be a crash on one set of the ai nodes, never both - possibly suggesting they were interfering with each other.

def generate_launch_description():

    ai_stuff_1 = IncludeLaunchDescription(
        PythonLaunchDescriptionSource(
            [
                os.path.join(
                    get_package_share_directory("ai_1"), "launch"
                ),
                "/ai_1.launch.py",
            ]
        )
    )

    ai_stuff_2 = IncludeLaunchDescription(
        PythonLaunchDescriptionSource(
            [
                os.path.join(get_package_share_directory("ai_2"), "launch"),
                "/ai_2.launch.py",
            ]
        )
    )

    # Using TimerAction to delay panorama_action_server launch by 5 seconds
    delayed_ai_stuff_1 = TimerAction(
        period=0.1,
        actions=[ai_stuff_1],
    )

    delayed_ai_stuff_2 = TimerAction(
        period=5.0,
        actions=[ai_stuff_2],
    )




    return LaunchDescription(
        [
             delayed_ai_stuff_1,
             delayed_ai_stuff_2
        ]
    )