ROS2 Launch Crash Randomly

tom_grimwood · March 5, 2024, 10:24pm

I am deploying a set of robots utilsiing the Isaac ROS packages, including isaac_ros_dnn_image_encoder and isaac_ros_tensor_rt.

I am using an Nvidia Orin Dev kit 64GB with an m.2 installed
Model: Jetson AGX Orin Developer Kit - Jetpack 5.1.2 [L4T 35.4.1]

Libraries:
CUDA: 11.4.315
CUDNN: 8.6.0.166
TensorRT: 8.5.2.2
VPI: 2.3.9
Vulkan: 1.3.204
OpenCV: 4.5.4 with CUDA:NO

On launch there are 2 sets of AI image processing/inference containers that spin up.

~30% of the time, a component will crash on launch:

e.g.

1709515396.9951015 [component_container_mt-11] NvMMLiteOpen : Block : BlockType = 261 
1709515397.0986693 [component_container_mt-11] NvMMLiteBlockCreate : Block : BlockType = 261 
1709515397.1022320 [component_container_mt-11] [INFO] [1709515397.101109772] [abc.panorama_server.video_h264_decoder]: [NitrosContext] Running application...
1709515397.1090574 [component_container_mt-11] [INFO] [1709515397.104371090] [abc.panorama_server.video_h264_decoder]: [NitrosNode] Starting a heartbeat timer (eid=17)
1709515397.1104555 [component_container_mt-11] [INFO] [1709515397.104604756] [abc.panorama_server.video_resize_node]: [NitrosContext] Loading application: '/tmp/isaac_ros_nitros/graphs/RUKDNOJEZN/RUKDNOJEZN.yaml'
1709515397.1112237 [component_container_mt-11] [INFO] [1709515397.104717110] [abc.panorama_server.video_dnn_encoder]: [NitrosNode] Initializing NitrosNode
1709515397.1119342 [component_container_mt-11] [INFO] [1709515397.105246908] [abc.panorama_server.video_h264_decoder]: Negotiating
1709515397.1126776 [component_container_mt-11] [INFO] [1709515397.106614124] [abc.panorama_server.video_dnn_encoder]: [NitrosNode] Starting NitrosNode
1709515397.1133666 [component_container_mt-11] [INFO] [1709515397.106669132] [abc.panorama_server.video_dnn_encoder]: [NitrosNode] Loading built-in preset extension specs
1709515397.1140604 [component_container_mt-11] e[1;31m2024-03-04 14:23:17.108 ERROR gxf/std/type_registry.cpp@48: Unknown type: nvidia::gxf::TensorRtInferencee[0m
1709515397.1147683 [component_container_mt-11] e[1;31m2024-03-04 14:23:17.108 ERROR gxf/std/yaml_file_loader.cpp@399: Could not add component of type 'nvidia::gxf::TensorRtInference' to entity.e[0m
1709515397.1154776 [component_container_mt-11] [ERROR] [1709515397.108336480] [abc.panorama_server.video_resize_node]: [NitrosNode] LoadApplication Error: GXF_FACTORY_UNKNOWN_CLASS_NAME
1709515397.1166997 [component_container_mt-11] terminate called after throwing an instance of 'std::runtime_error'
1709515397.1174448 [component_container_mt-11]   what():  [NitrosNode] LoadApplication Error: GXF_FACTORY_UNKNOWN_CLASS_NAME
1709515397.2644863 [foxglove_bridge-1] [INFO] [1709515397.261071500] [abc.foxglove_bridge]: Subscribing to topic "/abc/detection_server/rts_image/apriltag_image_annotations" (foxglove_msgs/msg/ImageAnnotations) on channel 36
1709515397.2692885 [foxglove_bridge-1] [INFO] [1709515397.268544739] [abc.foxglove_bridge]: Subscribing to topic "/abc/detection_server/rts_image/bbox_image_annotations" (foxglove_msgs/msg/ImageAnnotations) on channel 35
1709515397.6883087 [detection_server-5] [INFO] [1709515397.687620387] [abc.detection_server]: Initialising Detection Server.
1709515397.6904640 [detection_server-5] [INFO] [1709515397.690162688] [abc.detection_server]: Detection Service Initialised.
1709515397.9373837 [ERROR] [component_container_mt-11]: process has died [pid 29954, exit code -6, cmd '/opt/ros/humble/lib/rclcpp_components/component_container_mt --ros-args -r __node:=tensor_rt_container -r __ns:=/abc/panorama_server'].

or another example:

abc-ai-run  | [component_container_mt-7] [INFO] [1709676265.593694817] [abc.detection_server.rts_image_dnn_encoder]: [NitrosContext] Running application...
abc-ai-run  | [component_container_mt-7] [INFO] [1709676265.595963651] [abc.detection_server.tensor_rt]: [NitrosContext] Loading application: '/tmp/isaac_ros_nitros/graphs/NYTVHSFZKR/NYTVHSFZKR.yaml'
abc-ai-run  | [component_container_mt-7] [INFO] [1709676265.606229519] [abc.detection_server.tensor_rt]: [NitrosNode] Linking Nitros pub/sub to the loaded application
abc-ai-run  | [component_container_mt-7] [ERROR] [1709676265.608126001] [abc.detection_server.tensor_rt]: [NitrosContext] GXFEntityFind Error: GXF_ENTITY_NOT_FOUND
abc-ai-run  | [component_container_mt-7] [ERROR] [1709676265.608557010] [abc.detection_server.tensor_rt]: [NitrosContext] getCid Error: GXF_ENTITY_NOT_FOUND
abc-ai-run  | [component_container_mt-7] [ERROR] [1709676265.608599794] [abc.detection_server.tensor_rt]: [NitrosNode] Failed to get the pointer of nvidia::gxf::DoubleBufferReceiver (inference/rx) for linking a NitrosSubscriber: GXF_ENTITY_NOT_FOUND
abc-ai-run  | [component_container_mt-7] terminate called after throwing an instance of 'std::runtime_error'
abc-ai-run  | [component_container_mt-7]   what():  [NitrosNode] Failed to get the pointer of nvidia::gxf::DoubleBufferReceiver (inference/rx) for linking a NitrosSubscriber: GXF_ENTITY_NOT_FOUND
abc-ai-run  | [detection_server-5] [INFO] [1709676266.443703563] [abc.detection_server]: Initialising Detection Server.
abc-ai-run  | [detection_server-5] [INFO] [1709676266.516133822] [abc.detection_server]: Detection Service Initialised.
abc-ai-run  | [ERROR] [component_container_mt-7]: process has died [pid 29149, exit code -6, cmd '/opt/ros/humble/lib/rclcpp_components/component_container_mt --ros-args -r __node:=tensor_rt_container -r __ns:=/abc/detection_server'].

Simply relaunching the container will eventually allow it to run without errors.

I am launching via a docker compose file based on the nvcr.io/nvidia/isaac/ros:aarch64-ros2_humble image,

FROM nvcr.io/nvidia/isaac/ros:aarch64-ros2_humble_b7e1ed6c02a6fa3c1c7392479291c035

...
...

RUN apt-get update && apt-get install -y \ 
    ros-humble-isaac-ros-common \
    ros-humble-isaac-ros-dnn-image-encoder \
    ros-humble-isaac-ros-tensor-rt \
    ros-humble-isaac-ros-h264-decoder \
    ros-humble-isaac-ros-image-pipeline \
    ros-humble-isaac-ros-nitros \

the following volumes are mounted into the container

    volumes:
      - ${HOME}/.Xauthority:/home/admin/.Xauthority:rw
      - /dev/*:/dev/*
      - /etc/localtime:/etc/localtime:ro
      - /usr/bin/tegrastats:/usr/bin/tegrastats
      - /tmp/argus_socket:/tmp/argus_socket
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcusolver.so.11:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libcusolver.so.11
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcusparse.so.11:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libcusparse.so.11
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcurand.so.10:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libcurand.so.10
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libnvToolsExt.so:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libnvToolsExt.so
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcupti.so.11.4:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libcupti.so.11.4
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcudla.so.1:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libcudla.so.1
      - /usr/local/cuda-11.4/targets/aarch64-linux/include/nvToolsExt.h:/usr/local/cuda-11.4/targets/aarch64-linux/include/nvToolsExt.h
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcufft.so.10:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libcufft.so.10
      - /usr/lib/aarch64-linux-gnu/tegra:/usr/lib/aarch64-linux-gnu/tegra
      - /usr/src/jetson_multimedia_api:/usr/src/jetson_multimedia_api
      - /opt/nvidia/nsight-systems-cli:/opt/nvidia/nsight-systems-cli
      - /opt/nvidia/vpi2:/opt/nvidia/vpi2
      - /usr/share/vpi2:/usr/share/vpi2

A snippet from the launch file:

    h264_decoder = ComposableNode(
        name="video_h264_decoder",
        package="isaac_ros_h264_decoder",
        namespace=[LaunchConfiguration("ns"), "/panorama_server"],
        plugin="nvidia::isaac_ros::h264_decoder::DecoderNode",
        parameters=[
            {
                "input_height": 1080,
                "input_width": 1920,
            }
        ],
        remappings=[
            (
                "image_compressed",
                ["/", LaunchConfiguration("ns"), "/", VIDEO_INPUT_TOPIC, "/", "h264"],
            ),
            (
                "image_uncompressed",
                ["/", LaunchConfiguration("ns"), "/", VIDEO_INPUT_TOPIC],
            ),
        ],
    )

    image_encoder_node = ComposableNode(
        name="video_dnn_encoder",
        namespace=[LaunchConfiguration("ns"), "/panorama_server"],
        package="isaac_ros_dnn_image_encoder",
        plugin="nvidia::isaac_ros::dnn_inference::DnnImageEncoderNode",
        parameters=[
            {
                "input_image_width" : model_dimension_width,
                "input_image_height" : model_dimension_height,
                "network_image_width": model_dimension_width,
                "network_image_height": model_dimension_height,
                "image_mean": [0.0, 0.0, 0.0],
                "image_stddev": [
                    PIXEL_SCALE_INVERSE,
                    PIXEL_SCALE_INVERSE,
                    PIXEL_SCALE_INVERSE,
                ],
            }
        ],
        remappings=[
            ("encoded_tensor", "tensor_pub"),
            ("image", ["/", LaunchConfiguration("ns"), "/", VIDEO_INPUT_TOPIC]),
        ],
    )

    image_resize_node = ComposableNode(
        name="video_resize_node",
        namespace=[LaunchConfiguration("ns"), "/panorama_server"],
        package="isaac_ros_image_proc",
        plugin="nvidia::isaac_ros::image_proc::ResizeNode",
        parameters=[
            {
                "output_height" : model_dimension_height,
                "output_width" : model_dimension_width,
                "keep_aspect_ratio": False,
            }
        ],
        remappings=[
            ("image", ["/", LaunchConfiguration("ns"), "/", VIDEO_INPUT_TOPIC]),
            ("camera_info", ["/", LaunchConfiguration("ns"), "/camera_info"]),
            ("resize/image", ["/", LaunchConfiguration("ns"), "/", VIDEO_INPUT_TOPIC+"_resized"]),
            ("resize/camera_info", ["/", LaunchConfiguration("ns"), "/camera_info"+"_resized"]),
        ],
    )

    tensorrt_inference_node = ComposableNode(
        name="tensor_rt",
        namespace=[LaunchConfiguration("ns"), "/panorama_server"],
        package="isaac_ros_tensor_rt",
        plugin="nvidia::isaac_ros::dnn_inference::TensorRTNode",
        parameters=[
            {
                "engine_file_path": model_engine_file_path,
                "output_binding_names": [
                    "num_detections",
                    "detection_boxes",
                    "detection_scores",
                    "detection_classes",
                ],
                "output_tensor_names": [
                    "num_detections",
                    "detection_boxes",
                    "detection_scores",
                    "detection_classes",
                ],
                "input_tensor_names": ["input_tensor"],
                "input_binding_names": ["input"],
                "force_engine_update": False,
            }
        ],
    )

    video_inference_container = ComposableNodeContainer(
        name="tensor_rt_container",
        namespace=[LaunchConfiguration("ns"), "/panorama_server"],
        package="rclcpp_components",
        executable="component_container_mt",
        # The h264 image is received. It is then h264 decoded, resized, tensor encoded, and then inferenced.
        composable_node_descriptions=[
            h264_decoder,
            image_resize_node,
            image_encoder_node,
            tensorrt_inference_node,
        ],
    )

Raffaello · March 6, 2024, 6:51pm

Hi @tom_grimwood

You designed a new Docker container, but looking at your configuration and logs, I don’t see any relevant bug or error.
Please, if you can, share part of your Dockerfile so we can figure out where this crash is coming from.

Raffaello

tom_grimwood · March 6, 2024, 9:03pm

Hey Rafaello, Here is the full DockerFile.

FROM nvcr.io/nvidia/isaac/ros:aarch64-ros2_humble_b7e1ed6c02a6fa3c1c7392479291c035

# Setup non-root admin user
ARG USERNAME
ARG USER_UID=1000
ARG USER_GID=1000

# disable terminal interaction for apt
ENV DEBIAN_FRONTEND=noninteractive
ENV SHELL /bin/bash
SHELL ["/bin/bash", "-c"]

# Env setup
RUN locale-gen en_US en_US.UTF-8
RUN update-locale LC_ALL=en_US.UTF-8 LANG=en_US.UTF-8
ENV LANG=en_US.UTF-8
ENV ROS_PYTHON_VERSION=3
ENV ROS_DISTRO=humble
ENV ROS_ROOT=/opt/ros/${ROS_DISTRO}

# Install Isaac ROS packages
RUN apt-get update && apt-get install -y \ 
    ros-humble-isaac-ros-common \
    ros-humble-isaac-ros-dnn-image-encoder \
    ros-humble-isaac-ros-tensor-rt \
    ros-humble-isaac-ros-h264-decoder \
    ros-humble-isaac-ros-image-pipeline \
    ros-humble-isaac-ros-nitros \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean

# Install foxglove deps.
RUN apt-get update && apt-get install -y \
        ros-humble-foxglove-msgs \
        ros-humble-foxglove-bridge \
        libwebsocketpp-dev \   
        ros-humble-ament-cmake-clang-format \
        ros-humble-resource-retriever \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean


RUN pip3 install transforms3d

# a dependency
RUN apt-get update && apt-get install -y \
        ros-humble-tf-transformations \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean
    


# Install AprilTag and mcap deps.
RUN apt-get update && apt-get install -y \
        ros-humble-apriltag-msgs \
        ros-humble-apriltag \
        ros-humble-rosbag2-storage-mcap \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean


# Copy source files
COPY ai/src /workspaces/isaac_ros-dev/src


# Build non Isaac Ros packages from source
RUN apt-get update \
    && source ${ROS_ROOT}/setup.bash && cd /workspaces/isaac_ros-dev/src \
    && cd /workspaces/isaac_ros-dev \
    && rosdep install -y -r --ignore-src --from-paths src --rosdistro ${ROS_DISTRO} \
    && colcon build --merge-install --cmake-args -DCMAKE_BUILD_TYPE=RelWithDebInfo \
    && rm -Rf src build log \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean

# Install prerequisites
RUN apt-get update && apt-get install -y \
        sudo \
        udev \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get clean

# Reuse triton-server user as 'admin' user if exists
RUN if [ $(getent group triton-server) ]; then \
        groupmod -o --gid ${USER_GID} -n ${USERNAME} triton-server ; \
        usermod -l ${USERNAME} -u ${USER_UID} -m -d /home/${USERNAME} triton-server ; \
        mkdir -p /home/${USERNAME} ; \
        sudo chown ${USERNAME}:${USERNAME} /home/${USERNAME} ; \
    fi

# Create the 'admin' user if not already exists
RUN if [ ! $(getent passwd ${USERNAME}) ]; then \
        groupadd --gid ${USER_GID} ${USERNAME} ; \
        useradd --uid ${USER_UID} --gid ${USER_GID} -m ${USERNAME} ; \
    fi

# Update 'admin' user
RUN echo ${USERNAME} ALL=\(root\) NOPASSWD:ALL > /etc/sudoers.d/${USERNAME} \
    && chmod 0440 /etc/sudoers.d/${USERNAME} \
    && adduser ${USERNAME} video && adduser ${USERNAME} plugdev && adduser ${USERNAME} sudo


RUN mkdir -p /usr/local/share/middleware_profiles
RUN mkdir -p /usr/local/share/mcap_profiles
COPY middleware_profiles/*profile.xml /usr/local/share/middleware_profiles/
COPY mcap-config.yaml /usr/local/share/mcap_profiles/
ENV USERNAME=${USERNAME}
ENV USER_GID=${USER_GID}
ENV USER_UID=${USER_UID}

Here is the docker compose

version: '3.9'
services:

  ros_humble:
    build:
      dockerfile: docker/humble.jetson.Dockerfile
      args:
        USERNAME: admin
    image: ai_run
    runtime: nvidia
    container_name: ros_humble
    privileged: true
    network_mode: host
    pid: "host"
    ipc: host
    user: admin
    stdin_open: true
    tty: true
    working_dir: /workspaces/isaac_ros-dev
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=all

    volumes:
      - ${HOME}/.Xauthority:/home/admin/.Xauthority:rw
      - /dev/*:/dev/*
      - /etc/localtime:/etc/localtime:ro
      - /usr/bin/tegrastats:/usr/bin/tegrastats
      - /tmp/argus_socket:/tmp/argus_socket
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcusolver.so.11:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libcusolver.so.11
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcusparse.so.11:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libcusparse.so.11
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcurand.so.10:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libcurand.so.10
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libnvToolsExt.so:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libnvToolsExt.so
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcupti.so.11.4:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libcupti.so.11.4
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcudla.so.1:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libcudla.so.1
      - /usr/local/cuda-11.4/targets/aarch64-linux/include/nvToolsExt.h:/usr/local/cuda-11.4/targets/aarch64-linux/include/nvToolsExt.h
      - /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcufft.so.10:/usr/local/cuda-11.4/targets/aarch64-linux/lib/libcufft.so.10
      - /usr/lib/aarch64-linux-gnu/tegra:/usr/lib/aarch64-linux-gnu/tegra
      - /usr/src/jetson_multimedia_api:/usr/src/jetson_multimedia_api
      - /opt/nvidia/nsight-systems-cli:/opt/nvidia/nsight-systems-cli
      - /opt/nvidia/vpi2:/opt/nvidia/vpi2
      - /usr/share/vpi2:/usr/share/vpi2

I will also try it on a different Orin today and see if the issue persists

tom_grimwood · March 7, 2024, 10:44pm

I think I have fixed the issue, my guess is that I was launching 2 sets of composable nodes, with lots of Isaac ROS nodes within, I have now added a small delay between the launching of the two, and the issue has resolved itself.

I guessed this may be a fix because there would only be a crash on one set of the ai nodes, never both - possibly suggesting they were interfering with each other.

def generate_launch_description():

    ai_stuff_1 = IncludeLaunchDescription(
        PythonLaunchDescriptionSource(
            [
                os.path.join(
                    get_package_share_directory("ai_1"), "launch"
                ),
                "/ai_1.launch.py",
            ]
        )
    )

    ai_stuff_2 = IncludeLaunchDescription(
        PythonLaunchDescriptionSource(
            [
                os.path.join(get_package_share_directory("ai_2"), "launch"),
                "/ai_2.launch.py",
            ]
        )
    )

    # Using TimerAction to delay panorama_action_server launch by 5 seconds
    delayed_ai_stuff_1 = TimerAction(
        period=0.1,
        actions=[ai_stuff_1],
    )

    delayed_ai_stuff_2 = TimerAction(
        period=5.0,
        actions=[ai_stuff_2],
    )




    return LaunchDescription(
        [
             delayed_ai_stuff_1,
             delayed_ai_stuff_2
        ]
    )

Topic		Replies	Views
Isaac_ros (Jetpack 5.x / Humble) - container won't start error Isaac ROS	9	1860	July 19, 2022
Error while building docker image for Isaac ROS Isaac ROS ros , isaac-ros-pose-estimation	2	375	October 26, 2024
Jetson Orin Nano isaac_ros_ess Problem: tao-converter Isaac ROS jetson	5	77	November 1, 2024
Isaac ROS RealSense Setup: Cannot launch docker Isaac ROS camera , jetson	7	109	October 28, 2024
Isaac ROS 3.1 Error (Dependency) Isaac ROS jetson	3	42	January 29, 2025
Can't get docker container to full cache Isaac ROS boot , docker , isaac-ros-object-detection , isaac-ros-nvblox	1	349	May 14, 2024
ROS Argus Camera dies on Startup Isaac ROS	10	780	August 8, 2024
Getting error in running the docker Isaac ROS docker , jetson	6	288	November 7, 2024
Ros2 doesnt build Isaac ROS ros-2-humble	10	108	November 11, 2024
Ros2_jetson repo: Unable to run ROS2 example app inside container built from repo Dockerfile Jetson AGX Xavier ros , docker , ngc	29	2490	October 5, 2022

ROS2 Launch Crash Randomly

Related topics