Why Kafka broker doesn't always detect that the server is down?

Hi All,

I’m using Jetson Xavier NX (?) with DS 6.0.0 installed.

My app is based on deepstream-app and is configured to send messages to a Kafka server.
It works fine and the only change I wanted to make is to run it is a service that starts on boot.
I accomplished that by creating systemd unit for my app:

[Unit]
Description=start app as a service on boot
After=nvds_logger.service

[Service]
ExecStart=/home/user/project/app/bin/systemd/start_app_as_service.sh
Restart=always
RestartSec=60

[Install]
WantedBy=multi-user.target

start_app_as_service.sh is just a simple script:

#!/bin/bash

APP_DIR=/home/user/project/app
GST_DEBUG=ERROR $APP_DIR/bin/app -c $APP_DIR/config/app.cfg >> $APP_DIR/log 2>&1

When my Kafka server is up and running and the Jetson reboots, the app starts normally and I can see it communicates with the server by checking sudo systemctl status app.service.
If I then shut down the Kafka server, I get error messages in the nvds_logger file and also in sudo systemctl status app.service.
So far so good.
However, if my Kafka server is down and I reboot the Jetson, the app starts normally, there is no errors in the nvds_logger file and sudo systemctl status app.service shows no errors as well - pretty weird, I expected the app to fail because the server is not up.
By trial and error I managed to get what I want by adding a delay to the [Service] section of my app’s unit:

ExecStartPre=/bin/sleep 30

So it looks like for some reason when the app is started at boot time, the broker is unable to detect that there is no connection with the server (perhaps no network connection at all), maybe because something else has not started yet?

Can anyone more familiar with the matter comment on it?
I’m happy that my config is working but it’s either a) something wrong with my approach or b) there is a bug in Kafka broker code.
In any case I’d love to know more about the issue.

p.s I take it that I need to call setup_nvds_logger.sh after every reboot and I do so by using the same systemd technique. What I don’t quite understand is why errors like “Kafka broker is down” appear in the nvds_logger file but on the screen I get only this:

ERROR from sink_sub_bin_sink1: Could not configure supporting library.

Debug info: /dvs/git/dirty/git-master_linux/deepstream/sdk/src/gst-plugins/gst-nvmsgbroker/gstnvmsgbroker.cpp(402): legacy_gst_nvmsgbroker_start (): /GstPipeline:pipeline/GstBin:sink_sub_bin1/GstNvMsgBroker:sink_sub_bin_sink1:

unable to connect to broker library

ERROR from sink_sub_bin_sink1: GStreamer error: state change failed and some element failed to post a proper error message with the reason for the failure.

To me the error message “unable to connect to broker library” sounds incorrect because I know for a fact that the reason for that is it is unable to connect to the server, which is quite different (and that’s exactly what it outputs to the nvds_logger file only).

Ideally I would like to have all warnings/errors produced by the app in one place and currently I’m trying to achieve that by redirecting its output to file:

./bin/app -c config/app.cfg >> log 2>&1

but it lacks those error messages that appear in nvds_logger file.
What should I change to achieve my goal?

We support reconnect policy.
Please enable it in config
To enable : auto-reconnect=1

Ok, let me clarify something.
According to the documentation a reconnection feature is part of nv_msgbroker interface that should be enabled by setting new-api to 1, and I didn’t do that.
On top of that, as I described in my initial post, the application is able to detect that the server is down when the system is up and running but fails to do so when the system is starting.

So your suggestion is how to change my configuration to prevent it from failing when the server/connection is down but my goal is to find out how to make the current mode of operation consistent, i.e whenever there is no connection with the server, the application terminates (as it suits me better).

By the way, what happens to generated messages while the connection with the server is down? Are they buffered somehow or what? Do we have any control over it?
Ideally I don’t want to lose any messages and therefore was wondering if there is a way to configure the broker to keep incoming messages if the connection is down for up to 15 minutes, for example, and getting rid of older messages as soon as new ones arrive when the connection is still down beyond that 15 minutes limit.

According to the documentation a reconnection feature is part of nv_msgbroker interface that should be enabled by setting new-api to 1, and I didn’t do that.
On top of that, as I described in my initial post, the application is able to detect that the server is down when the system is up and running but fails to do so when the system is starting.

Did you run setup_nvds_logger.sh before your app service started when reboot?
I think the logger did not catch your app status.

We do not support it. since broker open sourced, you can customize the plugin to adapt to your needs. like poll connection callback, if got failed status, add the message into one buffer for later process.

So your suggestion is how to change my configuration to prevent it from failing when the server/connection is down but my goal is to find out how to make the current mode of operation consistent, i.e whenever there is no connection with the server, the application terminates (as it suits me better).

#connection max retry limit in seconds
max-retry-limit=value
Try this.

Yes I did.
My point is that if I manually start my app as a service when the system is completely booted an Kafka server is down, it exits with error code and produces error messages in the log and in standard output.
However, when it is started automatically by the system during startup, it keeps running and I can see it repeatedly calls do_work method.

Well, I eventually found a way to start my app as a service on boot so it detects status of the connection with Kafka server.
The key thing is to configure the unit to wait until the system reaches a state when everything the app needs is running by adding multi-user.target as another After:

[Unit]
Description=start app as a service on boot
Wants=network-online.target
Wants=nvds_logger.service
After=nvds_logger.service
After=network-online.target
After=multi-user.target

[Service]
ExecStart=/home/user/project/app/bin/systemd/start_app_as_service.sh
Restart=always
RestartSec=60

[Install]
WantedBy=multi-user.target

The unit that describes nvds_logger looks almost the same.

The bottom line is the code of Kafka broker does not detect some situations when connection with server cannot be established and assumes they are connected, which is bad.
I’m happy that in my case there is a workaround and I’ll leave fixing of the Kafka code to the NVidia team.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.